Index Skip Scan (new UniqueKeys)

9erthalion6@gmail.com

over 5 years ago

In reply to: Andy Fan (#2)

Re: Index Skip Scan (new UniqueKeys)

On Thu, Jun 11, 2020 at 04:14:07PM +0800, Andy Fan wrote:

I just get the rough idea of patch, looks we have to narrow down the
user cases where we can use this method. Consider the below example:

Not exactly narrow down, but rather get rid of wrong usage of skipping
for index scan. Since skipping for it was added later than for index
only scan I can imagine there are still blind spots, so good that you've
looked. In this particular case, when index expressions do not fully
cover those expressionse result need to be distinct on, skipping just
doesn't have enough information and should not be used. I'll add it to
the next version, thanks!

pg@bowt.ie

over 5 years ago

In reply to: Dmitry Dolgov (#1)

Re: Index Skip Scan (new UniqueKeys)

On Tue, Jun 9, 2020 at 3:20 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

* Btree-implementation contains btree specific code to implement amskip,
introduced in the previous patch.

The way that you're dealing with B-Tree tuples here needs to account
for posting list tuples:

+               currItem = &so->currPos.items[so->currPos.lastItem];
+               itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+               nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);

But I wonder more generally what the idea here is. The following
comments that immediately follow provide some hints:

+               /*
+                * To check if we returned the same tuple, try to find a
+                * startItup on the current page. For that we need to update
+                * scankey to match the whole tuple and set nextkey to return
+                * an exact tuple, not the next one. If the nextOffset is the
+                * same as before, it means we are in the loop, return offnum
+                * to the original position and jump further
+                */

Why does it make sense to use the offset number like this? It isn't
stable or reliable. The patch goes on to do this:

+                   startOffset = _bt_binsrch(scan->indexRelation,
+                                             so->skipScanKey,
+                                             so->currPos.buf);
+
+                   page = BufferGetPage(so->currPos.buf);
+                   maxoff = PageGetMaxOffsetNumber(page);
+
+                   if (nextOffset <= startOffset)
+                   {

Why compare a heap TID's offset number (an offset number for a heap
page) to another offset number for a B-Tree leaf page? They're
fundamentally different things.

--
Peter Geoghegan

florisvannee@Optiver.com

over 5 years ago

In reply to: Dmitry Dolgov (#3)

RE: Index Skip Scan (new UniqueKeys)

Hi Dmitry,

Also took another look at the patch now, and found a case of incorrect data. It looks related to the new way of creating the paths, as I can't recall seeing this in earlier versions.

create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 10) a, generate_series(1,100) b;
create index on t1 (a,b,c);

postgres=# explain select distinct on (a) * from t1 order by a,b desc,c;
QUERY PLAN
-------------------------------------------------------------------------------
Sort (cost=2.92..2.94 rows=10 width=20)
Sort Key: a, b DESC, c
-> Index Scan using t1_a_b_c_idx on t1 (cost=0.28..2.75 rows=10 width=20)
Skip scan: true
(4 rows)

With the 'order by a, b desc, c' we expect the value of column 'b' to always be 100. With index_skipscan enabled, it always gives 1 though. It's not correct that the planner chooses a skip scan followed by sort in this case.

-Floris

9erthalion6@gmail.com

over 5 years ago

In reply to: Peter Geoghegan (#4)

Re: Index Skip Scan (new UniqueKeys)

On Wed, Jul 08, 2020 at 03:44:26PM -0700, Peter Geoghegan wrote:

On Tue, Jun 9, 2020 at 3:20 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

* Btree-implementation contains btree specific code to implement amskip,
introduced in the previous patch.

The way that you're dealing with B-Tree tuples here needs to account
for posting list tuples:
+               currItem = &so->currPos.items[so->currPos.lastItem];
+               itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+               nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);

Do you mean this last part with t_tid, which could also have a tid array
in case of posting tuple format?

+               /*
+                * To check if we returned the same tuple, try to find a
+                * startItup on the current page. For that we need to update
+                * scankey to match the whole tuple and set nextkey to return
+                * an exact tuple, not the next one. If the nextOffset is the
+                * same as before, it means we are in the loop, return offnum
+                * to the original position and jump further
+                */

Why does it make sense to use the offset number like this? It isn't
stable or reliable. The patch goes on to do this:

+                   startOffset = _bt_binsrch(scan->indexRelation,
+                                             so->skipScanKey,
+                                             so->currPos.buf);
+
+                   page = BufferGetPage(so->currPos.buf);
+                   maxoff = PageGetMaxOffsetNumber(page);
+
+                   if (nextOffset <= startOffset)
+                   {

Why compare a heap TID's offset number (an offset number for a heap
page) to another offset number for a B-Tree leaf page? They're
fundamentally different things.

Well, it's obviously wrong, thanks for noticing. What is necessary is to
compare two index tuples, the start and the next one, to test if they're
the same (in which case if I'm not mistaken probably we can compare item
pointers). I've got this question when I was about to post a new version
with changes to address feedback from Andy, now I'll combine them and
send a cumulative patch.

9erthalion6@gmail.com

over 5 years ago

In reply to: Floris Van Nee (#5)

Re: Index Skip Scan (new UniqueKeys)

On Fri, Jul 10, 2020 at 05:03:37PM +0000, Floris Van Nee wrote:

Also took another look at the patch now, and found a case of incorrect
data. It looks related to the new way of creating the paths, as I
can't recall seeing this in earlier versions.

create table t1 as select a,b,b%5 as c, random() as d from generate_series(1, 10) a, generate_series(1,100) b;
create index on t1 (a,b,c);

postgres=# explain select distinct on (a) * from t1 order by a,b desc,c;
QUERY PLAN
-------------------------------------------------------------------------------
Sort (cost=2.92..2.94 rows=10 width=20)
Sort Key: a, b DESC, c
-> Index Scan using t1_a_b_c_idx on t1 (cost=0.28..2.75 rows=10 width=20)
Skip scan: true
(4 rows)

Good point, thanks for looking at this. With the latest planner version
there are indeed more possibilities to use skipping. It never occured to
me that some of those paths will still rely on index scan returning full
data set. I'll look in details and add verification to prevent putting
something like this on top of skip scan in the next version.

florisvannee@Optiver.com

over 5 years ago

In reply to: Dmitry Dolgov (#7)

7 attachment(s)

RE: Index Skip Scan (new UniqueKeys)

Good point, thanks for looking at this. With the latest planner version there
are indeed more possibilities to use skipping. It never occured to me that
some of those paths will still rely on index scan returning full data set. I'll look
in details and add verification to prevent putting something like this on top of
skip scan in the next version.

I believe the required changes are something like in attached patch. There were a few things I've changed:
- build_uniquekeys was constructing the list incorrectly. For a DISTINCT a,b, it would create two unique keys, one with a and one with b. However, it should be one unique key with (a,b).
- the uniquekeys that is built, still contains some redundant keys, that are normally eliminated from the path keys lists.
- the distinct_pathkeys may be NULL, even though there's a possibility for skipping. But it wouldn't create the uniquekeys in this case. This makes the planner not choose skip scans even though it could. For example in queries that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a is constant, it's eliminated from regular pathkeys.
- to combat the issues mentioned earlier, there's now a check in build_index_paths that checks if the query_pathkeys matches the useful_pathkeys. Note that we have to use the path keys here rather than any of the unique keys. The unique keys are only Expr nodes - they do not contain the necessary information about ordering. Due to elimination of some constant path keys, we have to search the attributes of the index to find the correct prefix to use in skipping.
- creating the skip scan path did not actually fill the Path's unique keys. It should just set this to query_uniquekeys.

I've attached the first two unique-keys patches (v9, 0001, 0002)), your patches, but rebased on v9 of unique keys (0003-0006) + a diff patch (0007) that applies my suggested changes on top of it.

-Floris

Attachments:

0001-Introduce-RelOptInfo-notnullattrs-attribute.patchapplication/octet-stream; name=0001-Introduce-RelOptInfo-notnullattrs-attribute.patchDownload

From c46a3317ed53ad8e24d1ed1f6b95b0687071f6c6 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Sun, 3 May 2020 22:37:46 +0800
Subject: [PATCH 1/7] Introduce RelOptInfo->notnullattrs attribute

The notnullattrs is calculated from catalog and run-time query. That
infomation is translated to child relation as well for partitioned
table.
---
 src/backend/optimizer/path/allpaths.c  | 31 ++++++++++++++++++++++++++
 src/backend/optimizer/plan/initsplan.c | 10 +++++++++
 src/backend/optimizer/util/plancat.c   | 10 +++++++++
 src/include/nodes/pathnodes.h          |  2 ++
 4 files changed, 53 insertions(+)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index c4e1967f12..9dade061c8 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1005,6 +1005,7 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 		RelOptInfo *childrel;
 		ListCell   *parentvars;
 		ListCell   *childvars;
+		int i = -1;
 
 		/* append_rel_list contains all append rels; ignore others */
 		if (appinfo->parent_relid != parentRTindex)
@@ -1061,6 +1062,36 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 								   (Node *) rel->reltarget->exprs,
 								   1, &appinfo);
 
+		/* Copy notnullattrs. */
+		while ((i = bms_next_member(rel->notnullattrs, i)) > 0)
+		{
+			AttrNumber attno = i + FirstLowInvalidHeapAttributeNumber;
+			AttrNumber child_attno;
+			if (attno == 0)
+			{
+				/* Whole row is not null, so must be same for child */
+				childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+														attno - FirstLowInvalidHeapAttributeNumber);
+				break;
+			}
+			if (attno < 0 )
+				/* no need to translate system column */
+				child_attno = attno;
+			else
+			{
+				Node * node = list_nth(appinfo->translated_vars, attno - 1);
+				if (!IsA(node, Var))
+					/* This may happens at UNION case, like (SELECT a FROM t1 UNION SELECT a + 3
+					 * FROM t2) t and we know t.a is not null
+					 */
+					continue;
+				child_attno = castNode(Var, node)->varattno;
+			}
+
+			childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+													child_attno - FirstLowInvalidHeapAttributeNumber);
+		}
+
 		/*
 		 * We have to make child entries in the EquivalenceClass data
 		 * structures as well.  This is needed either if the parent
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index e978b491f6..95b1b14cd3 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -830,6 +830,16 @@ deconstruct_recurse(PlannerInfo *root, Node *jtnode, bool below_outer_join,
 		{
 			Node	   *qual = (Node *) lfirst(l);
 
+			/* Set the not null info now */
+			ListCell	*lc;
+			List		*non_nullable_vars = find_nonnullable_vars(qual);
+			foreach(lc, non_nullable_vars)
+			{
+				Var *var = lfirst_node(Var, lc);
+				RelOptInfo *rel = root->simple_rel_array[var->varno];
+				rel->notnullattrs = bms_add_member(rel->notnullattrs,
+												   var->varattno - FirstLowInvalidHeapAttributeNumber);
+			}
 			distribute_qual_to_rels(root, qual,
 									false, below_outer_join, JOIN_INNER,
 									root->qual_security_level,
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 25545029d7..0b2f9d398a 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -117,6 +117,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	Relation	relation;
 	bool		hasindex;
 	List	   *indexinfos = NIL;
+	int			i;
 
 	/*
 	 * We need not lock the relation since it was already locked, either by
@@ -463,6 +464,15 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	if (inhparent && relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		set_relation_partition_info(root, rel, relation);
 
+	Assert(rel->notnullattrs == NULL);
+	for(i = 0; i < relation->rd_att->natts; i++)
+	{
+		FormData_pg_attribute attr = relation->rd_att->attrs[i];
+		if (attr.attnotnull)
+			rel->notnullattrs = bms_add_member(rel->notnullattrs,
+											   attr.attnum - FirstLowInvalidHeapAttributeNumber);
+	}
+
 	table_close(relation, NoLock);
 
 	/*
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 485d1b06c9..9e3ebd488a 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -709,6 +709,8 @@ typedef struct RelOptInfo
 	PlannerInfo *subroot;		/* if subquery */
 	List	   *subplan_params; /* if subquery */
 	int			rel_parallel_workers;	/* wanted number of parallel workers */
+	/* Not null attrs, start from -FirstLowInvalidHeapAttributeNumber */
+	Bitmapset		*notnullattrs;
 
 	/* Information about foreign tables and foreign joins */
 	Oid			serverid;		/* identifies server for the table or join */
-- 
2.25.0

0002-Introduce-UniqueKey-attributes-on-RelOptInfo-struct.patchapplication/octet-stream; name=0002-Introduce-UniqueKey-attributes-on-RelOptInfo-struct.patchDownload

From b7f578b700956374312885fbc5060a23ab64da51 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Mon, 11 May 2020 15:50:52 +0800
Subject: [PATCH 2/7] Introduce UniqueKey attributes on RelOptInfo struct.

UniqueKey is a set of exprs on RelOptInfo which represents the exprs
will be unique on the given RelOptInfo. You can see README.uniquekey
for more information.
---
 src/backend/nodes/copyfuncs.c               |   13 +
 src/backend/nodes/list.c                    |   31 +
 src/backend/nodes/makefuncs.c               |   13 +
 src/backend/nodes/outfuncs.c                |   11 +
 src/backend/nodes/readfuncs.c               |   10 +
 src/backend/optimizer/path/Makefile         |    3 +-
 src/backend/optimizer/path/README.uniquekey |  131 +++
 src/backend/optimizer/path/allpaths.c       |   10 +
 src/backend/optimizer/path/joinpath.c       |    9 +-
 src/backend/optimizer/path/joinrels.c       |    2 +
 src/backend/optimizer/path/pathkeys.c       |    3 +-
 src/backend/optimizer/path/uniquekeys.c     | 1131 +++++++++++++++++++
 src/backend/optimizer/plan/planner.c        |   13 +-
 src/backend/optimizer/prep/prepunion.c      |    2 +
 src/backend/optimizer/util/appendinfo.c     |   44 +
 src/backend/optimizer/util/inherit.c        |   16 +-
 src/include/nodes/makefuncs.h               |    3 +
 src/include/nodes/nodes.h                   |    1 +
 src/include/nodes/pathnodes.h               |   29 +-
 src/include/nodes/pg_list.h                 |    2 +
 src/include/optimizer/appendinfo.h          |    3 +
 src/include/optimizer/optimizer.h           |    2 +
 src/include/optimizer/paths.h               |   43 +
 23 files changed, 1502 insertions(+), 23 deletions(-)
 create mode 100644 src/backend/optimizer/path/README.uniquekey
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d8cf87e6d0..efe7c3ae10 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2273,6 +2273,16 @@ _copyPathKey(const PathKey *from)
 	return newnode;
 }
 
+static UniqueKey *
+_copyUniqueKey(const UniqueKey *from)
+{
+	UniqueKey	*newnode = makeNode(UniqueKey);
+
+	COPY_NODE_FIELD(exprs);
+	COPY_SCALAR_FIELD(multi_nullvals);
+
+	return newnode;
+}
 /*
  * _copyRestrictInfo
  */
@@ -5152,6 +5162,9 @@ copyObjectImpl(const void *from)
 		case T_PathKey:
 			retval = _copyPathKey(from);
 			break;
+		case T_UniqueKey:
+			retval = _copyUniqueKey(from);
+			break;
 		case T_RestrictInfo:
 			retval = _copyRestrictInfo(from);
 			break;
diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index 80fa8c84e4..a7a99b70f2 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -687,6 +687,37 @@ list_member_oid(const List *list, Oid datum)
 	return false;
 }
 
+/*
+ * return true iff every entry in "members" list is also present
+ * in the "target" list.
+ */
+bool
+list_is_subset(const List *members, const List *target)
+{
+	const ListCell	*lc1, *lc2;
+
+	Assert(IsPointerList(members));
+	Assert(IsPointerList(target));
+	check_list_invariants(members);
+	check_list_invariants(target);
+
+	foreach(lc1, members)
+	{
+		bool found = false;
+		foreach(lc2, target)
+		{
+			if (equal(lfirst(lc1), lfirst(lc2)))
+			{
+				found = true;
+				break;
+			}
+		}
+		if (!found)
+			return false;
+	}
+	return true;
+}
+
 /*
  * Delete the n'th cell (counting from 0) in list.
  *
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 49de285f01..646cf7c9a1 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -814,3 +814,16 @@ makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols)
 	v->va_cols = va_cols;
 	return v;
 }
+
+
+/*
+ * makeUniqueKey
+ */
+UniqueKey*
+makeUniqueKey(List *exprs, bool multi_nullvals)
+{
+	UniqueKey * ukey = makeNode(UniqueKey);
+	ukey->exprs = exprs;
+	ukey->multi_nullvals = multi_nullvals;
+	return ukey;
+}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515d..c3a9632992 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2428,6 +2428,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+	WRITE_NODE_FIELD(exprs);
+	WRITE_BOOL_FIELD(multi_nullvals);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4127,6 +4135,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 				_outPathKey(str, obj);
 				break;
+			case T_UniqueKey:
+				_outUniqueKey(str, obj);
+				break;
 			case T_PathTarget:
 				_outPathTarget(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab719..3a18571d0c 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -452,6 +452,14 @@ _readSetOperationStmt(void)
 	READ_DONE();
 }
 
+static UniqueKey *
+_readUniqueKey(void)
+{
+	READ_LOCALS(UniqueKey);
+	READ_NODE_FIELD(exprs);
+	READ_BOOL_FIELD(multi_nullvals);
+	READ_DONE();
+}
 
 /*
  *	Stuff from primnodes.h.
@@ -2656,6 +2664,8 @@ parseNodeString(void)
 		return_value = _readCommonTableExpr();
 	else if (MATCH("SETOPERATIONSTMT", 16))
 		return_value = _readSetOperationStmt();
+	else if (MATCH("UNIQUEKEY", 9))
+		return_value = _readUniqueKey();
 	else if (MATCH("ALIAS", 5))
 		return_value = _readAlias();
 	else if (MATCH("RANGEVAR", 8))
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..7b9820c25f 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekeys.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/README.uniquekey b/src/backend/optimizer/path/README.uniquekey
new file mode 100644
index 0000000000..5eac761995
--- /dev/null
+++ b/src/backend/optimizer/path/README.uniquekey
@@ -0,0 +1,131 @@
+1. What is UniqueKey?
+We can think UniqueKey is a set of exprs for a RelOptInfo, which we are insure
+that doesn't yields same result among all the rows. The simplest UniqueKey
+format is primary key.
+
+However we define the UnqiueKey as below.
+
+typedef struct UniqueKey
+{
+        NodeTag	type;
+        List	*exprs;
+        bool	multi_nullvals;
+} UniqueKey;
+
+exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+is a special case of UniqueKey, which means there is only one row in that
+relation.it has a stronger semantic than others. like SELECT uk FROM t; uk is
+normal unique key and may have different values. SELECT colx FROM t WHERE uk =
+const.  colx is unique AND we have only 1 value. This field can used for
+innerrel_is_unique. this logic is handled specially in add_uniquekey_for_onerow
+function.
+
+multi_nullvals: true means multi null values may exist in these exprs, so the
+uniqueness is not guaranteed in this case. This field is necessary for
+remove_useless_join & reduce_unique_semijoins where we don't mind these
+duplicated NULL values. It is set to true for 2 cases. One is a unique key
+from a unique index but the related column is nullable. The other one is for
+outer join. see populate_joinrel_uniquekeys for detail.
+
+
+The UniqueKey can be used at the following cases at least:
+1. remove_useless_joins.
+2. reduce_semianti_joins
+3. remove distinct node if distinct clause is unique.
+4. remove aggnode if group by clause is unique.
+5. Index Skip Scan (WIP)
+6. Aggregation Push Down without 2 phase aggregation if the join can't
+   duplicated the aggregated rows. (work in progress feature)
+
+2. How is it maintained?
+
+We have a set of populate_xxx_unqiuekeys functions to maintain the uniquekey on
+various cases. xxx includes baserel, joinrel, partitionedrel, distinctrel,
+groupedrel, unionrel. and we also need to convert the uniquekey from subquery
+to outer relation, which is what convert_subquery_uniquekeys does.
+
+1. The first part is about baserel. We handled 3 cases. suppose we have Unique
+Index on (a, b).
+
+1. SELECT a, b FROM t.  UniqueKey (a, b)
+2. SELECT a FROM t WHERE b = 1;  UniqueKey (a)
+3. SELECT .. FROM t WHERE a = 1 AND b = 1;  UniqueKey (NIL).  onerow case, every
+   column is Unique.
+
+2. The next part is joinrel, this part is most error-prone, we simplified the rules
+like below:
+1. If the relation's UniqueKey can't be duplicated after join,  then is will be
+   still valid for the join rel. The function we used here is
+   innerrel_keeps_unique. The basic idea is innerrel.any_col = outer.uk.
+
+2. If the UnqiueKey can't keep valid via the rule 1, the combination of the
+   UniqueKey from both sides are valid for sure.  We can prove this as: if the
+   unique exprs from rel1 is duplicated by rel2, the duplicated rows must
+   contains different unique exprs from rel2.
+
+More considerations about onerow:
+1. If relation with one row and it can't be duplicated, it is still possible
+   contains mulit_nullvas after outer join.
+2. If the either UniqueKey can be duplicated after join, the can get one row
+   only when both side is one row AND there is no outer join.
+3. Whenever the onerow UniqueKey is not a valid any more, we need to convert one
+   row UniqueKey to normal unique key since we don't store exprs for one-row
+   relation. get_exprs_from_uniquekeys will be used here.
+
+
+More considerations about multi_nullvals after join:
+1. If the original UnqiueKey has multi_nullvals, the final UniqueKey will have
+   mulit_nullvals in any case.
+2. If a unique key doesn't allow mulit_nullvals, after some outer join, it
+   allows some outer join.
+
+
+3. When we comes to subquery, we need to convert_subquery_unqiuekeys just like
+convert_subquery_pathkeys.  Only the UniqueKey insides subquery is referenced as
+a Var in outer relation will be reused. The relationship between the outerrel.Var
+and subquery.exprs is built with outerel->subroot->processed_tlist.
+
+
+4. As for the SRF functions, it will break the uniqueness of uniquekey, However it
+is handled in adjust_paths_for_srfs, which happens after the query_planner. so
+we will maintain the UniqueKey until there and reset it to NIL at that
+places. This can't help on distinct/group by elimination cases but probably help
+in some other cases, like reduce_unqiue_semijoins/remove_useless_joins and it is
+semantic correctly.
+
+
+5. As for inherit table, we first main the UnqiueKey on childrel as well. But for
+partitioned table we need to maintain 2 different kinds of
+UnqiueKey. 1). UniqueKey on the parent relation 2). UniqueKey on child
+relation for partition wise query.
+
+Example:
+CREATE TABLE p (a int not null, b int not null) partition by list (a);
+CREATE TABLE p0 partition of p for values in (1);
+CREATE TABLE p1 partition of p for values in (2);
+
+create unique index p0_b on p0(b);
+create unique index p1_b on p1(b);
+
+Now b is only unique on partition level, so the distinct can't be removed on
+the following cases. SELECT DISTINCT b FROM p;
+
+Another example is SELECT DISTINCT a, b FROM p WHERE a = 1; Since only one
+partition is chosen, the UniqueKey on child relation is same as the UniqueKey on
+parent relation.
+
+Another usage of UniqueKey on partition level is it be helpful for
+partition-wise join.
+
+As for the UniqueKey on parent table level, it comes with 2 different ways,
+1). the UniqueKey is also derived in UniqueKey index, but the index must be same
+in all the related children relations and the unique index must contains
+Partition Key in it. Example:
+
+CREATE UNIQUE INDEX p_ab ON p(a, b);  -- where a is the partition key.
+
+-- Query
+SELECT a, b FROM p; the (a, b) is a UniqueKey of p.
+
+2). If the parent relation has only one childrel, the UniqueKey on childrel is
+ the UniqueKey on parent as well.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 9dade061c8..b58bac6175 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -579,6 +579,12 @@ set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	 */
 	check_index_predicates(root, rel);
 
+	/*
+	 * Now that we've marked which partial indexes are suitable, we can now
+	 * build the relation's unique keys.
+	 */
+	populate_baserel_uniquekeys(root, rel, rel->indexlist);
+
 	/* Mark rel with estimated output rows, width, etc */
 	set_baserel_size_estimates(root, rel);
 }
@@ -1310,6 +1316,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 
 	/* Add paths to the append relation. */
 	add_paths_to_append_rel(root, rel, live_childrels);
+	if (IS_PARTITIONED_REL(rel))
+		populate_partitionedrel_uniquekeys(root, rel, live_childrels);
 }
 
 
@@ -2383,6 +2391,8 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 										  pathkeys, required_outer));
 	}
 
+	convert_subquery_uniquekeys(root, rel, sub_final_rel);
+
 	/* If outer rel allows parallelism, do same for partial paths. */
 	if (rel->consider_parallel && bms_is_empty(required_outer))
 	{
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index db54a6ba2e..ef0fd2fb0b 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -71,13 +71,6 @@ static void consider_parallel_mergejoin(PlannerInfo *root,
 static void hash_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 								 RelOptInfo *outerrel, RelOptInfo *innerrel,
 								 JoinType jointype, JoinPathExtraData *extra);
-static List *select_mergejoin_clauses(PlannerInfo *root,
-									  RelOptInfo *joinrel,
-									  RelOptInfo *outerrel,
-									  RelOptInfo *innerrel,
-									  List *restrictlist,
-									  JoinType jointype,
-									  bool *mergejoin_allowed);
 static void generate_mergejoin_paths(PlannerInfo *root,
 									 RelOptInfo *joinrel,
 									 RelOptInfo *innerrel,
@@ -1927,7 +1920,7 @@ hash_inner_and_outer(PlannerInfo *root,
  * if it is mergejoinable and involves vars from the two sub-relations
  * currently of interest.
  */
-static List *
+List *
 select_mergejoin_clauses(PlannerInfo *root,
 						 RelOptInfo *joinrel,
 						 RelOptInfo *outerrel,
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 2d343cd293..b9163ee8ff 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -924,6 +924,8 @@ populate_joinrel_with_paths(PlannerInfo *root, RelOptInfo *rel1,
 
 	/* Apply partitionwise join technique, if possible. */
 	try_partitionwise_join(root, rel1, rel2, joinrel, sjinfo, restrictlist);
+
+	populate_joinrel_uniquekeys(root, joinrel, rel1, rel2, restrictlist, sjinfo->jointype);
 }
 
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ce9bf87e9b..7e596d4194 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -33,7 +33,6 @@ static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
 											 int partkeycol);
-static Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 static bool right_merge_direction(PlannerInfo *root, PathKey *pathkey);
 
 
@@ -1035,7 +1034,7 @@ convert_subquery_pathkeys(PlannerInfo *root, RelOptInfo *rel,
  * We need this to ensure that we don't return pathkeys describing values
  * that are unavailable above the level of the subquery scan.
  */
-static Var *
+Var *
 find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle)
 {
 	ListCell   *lc;
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
new file mode 100644
index 0000000000..b33bcd2f32
--- /dev/null
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -0,0 +1,1131 @@
+/*-------------------------------------------------------------------------
+ *
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/uniquekeys.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/appendinfo.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/tlist.h"
+#include "rewrite/rewriteManip.h"
+
+
+/*
+ * This struct is used to help populate_joinrel_uniquekeys.
+ *
+ * added_to_joinrel is true if a uniquekey (from outerrel or innerrel)
+ * has been added to joinrel.
+ * useful is true if the exprs of the uniquekey still appears in joinrel.
+ */
+typedef struct UniqueKeyContextData
+{
+	UniqueKey	*uniquekey;
+	bool	added_to_joinrel;
+	bool	useful;
+} *UniqueKeyContext;
+
+static List *initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel);
+static bool innerrel_keeps_unique(PlannerInfo *root,
+								  RelOptInfo *outerrel,
+								  RelOptInfo *innerrel,
+								  List *restrictlist,
+								  bool reverse);
+
+static List *get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+										List *const_exprs,
+										List *const_expr_opfamilies,
+										Bitmapset *used_varattrs,
+										bool *useful,
+										bool *multi_nullvals);
+static List *get_exprs_from_uniquekey(RelOptInfo *joinrel,
+									  RelOptInfo *rel1,
+									  UniqueKey *ukey);
+static void add_uniquekey_for_onerow(RelOptInfo *rel);
+static bool add_combined_uniquekey(RelOptInfo *joinrel,
+								   RelOptInfo *outer_rel,
+								   RelOptInfo *inner_rel,
+								   UniqueKey *outer_ukey,
+								   UniqueKey *inner_ukey,
+								   JoinType jointype);
+
+/* Used for unique indexes checking for partitioned table */
+static bool index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind);
+static IndexOptInfo *simple_copy_indexinfo_to_parent(PlannerInfo *root,
+													 RelOptInfo *parentrel,
+													 IndexOptInfo *from);
+static bool simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2);
+static void adjust_partition_unique_indexlist(PlannerInfo *root,
+											  RelOptInfo *parentrel,
+											  RelOptInfo *childrel,
+											  List **global_unique_index);
+
+/* Helper function for grouped relation and distinct relation. */
+static void add_uniquekey_from_sortgroups(PlannerInfo *root,
+										  RelOptInfo *rel,
+										  List *sortgroups);
+
+/*
+ * populate_baserel_uniquekeys
+ *		Populate 'baserel' uniquekeys list by looking at the rel's unique index
+ * and baserestrictinfo
+ */
+void
+populate_baserel_uniquekeys(PlannerInfo *root,
+							RelOptInfo *baserel,
+							List *indexlist)
+{
+	ListCell *lc;
+	List	*matched_uniq_indexes = NIL;
+
+	/* Attrs appears in rel->reltarget->exprs. */
+	Bitmapset *used_attrs = NULL;
+
+	List	*const_exprs = NIL;
+	List	*expr_opfamilies = NIL;
+
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	foreach(lc, indexlist)
+	{
+		IndexOptInfo *ind = (IndexOptInfo *) lfirst(lc);
+		if (!ind->unique || !ind->immediate ||
+			(ind->indpred != NIL && !ind->predOK))
+			continue;
+		matched_uniq_indexes = lappend(matched_uniq_indexes, ind);
+	}
+
+	if (matched_uniq_indexes  == NIL)
+		return;
+
+	/* Check which attrs is used in baserel->reltarget */
+	pull_varattnos((Node *)baserel->reltarget->exprs, baserel->relid, &used_attrs);
+
+	/* Check which attrno is used at a mergeable const filter */
+	foreach(lc, baserel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+		if (rinfo->mergeopfamilies == NIL)
+			continue;
+
+		if (bms_is_empty(rinfo->left_relids))
+		{
+			const_exprs = lappend(const_exprs, get_rightop(rinfo->clause));
+		}
+		else if (bms_is_empty(rinfo->right_relids))
+		{
+			const_exprs = lappend(const_exprs, get_leftop(rinfo->clause));
+		}
+		else
+			continue;
+
+		expr_opfamilies = lappend(expr_opfamilies, rinfo->mergeopfamilies);
+	}
+
+	foreach(lc, matched_uniq_indexes)
+	{
+		bool	multi_nullvals, useful;
+		List	*exprs = get_exprs_from_uniqueindex(lfirst_node(IndexOptInfo, lc),
+													const_exprs,
+													expr_opfamilies,
+													used_attrs,
+													&useful,
+													&multi_nullvals);
+		if (useful)
+		{
+			if (exprs == NIL)
+			{
+				/* All the columns in Unique Index matched with a restrictinfo */
+				add_uniquekey_for_onerow(baserel);
+				return;
+			}
+			baserel->uniquekeys = lappend(baserel->uniquekeys,
+										  makeUniqueKey(exprs, multi_nullvals));
+		}
+	}
+}
+
+
+/*
+ * populate_partitionedrel_uniquekeys
+ * The UniqueKey on partitionrel comes from 2 cases:
+ * 1). Only one partition is involved in this query, the unique key can be
+ * copied to parent rel from childrel.
+ * 2). There are some unique index which includes partition key and exists
+ * in all the related partitions.
+ * We never mind rule 2 if we hit rule 1.
+ */
+
+void
+populate_partitionedrel_uniquekeys(PlannerInfo *root,
+								   RelOptInfo *rel,
+								   List *childrels)
+{
+	ListCell	*lc;
+	List	*global_uniq_indexlist = NIL;
+	RelOptInfo *childrel;
+	bool is_first = true;
+
+	Assert(IS_PARTITIONED_REL(rel));
+
+	if (childrels == NIL)
+		return;
+
+	/*
+	 * If there is only one partition used in this query, the UniqueKey in childrel is
+	 * still valid in parent level, but we need convert the format from child expr to
+	 * parent expr.
+	 */
+	if (list_length(childrels) == 1)
+	{
+		/* Check for Rule 1 */
+		RelOptInfo *childrel = linitial_node(RelOptInfo, childrels);
+		ListCell	*lc;
+		Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL);
+		if (relation_is_onerow(childrel))
+		{
+			add_uniquekey_for_onerow(rel);
+			return;
+		}
+
+		foreach(lc, childrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			AppendRelInfo *appinfo = find_appinfo_by_child(root, childrel->relid);
+			List *parent_exprs = NIL;
+			bool can_reuse = true;
+			ListCell	*lc2;
+			foreach(lc2, ukey->exprs)
+			{
+				Var *var = (Var *)lfirst(lc2);
+				/*
+				 * If the expr comes from a expression, it is hard to build the expression
+				 * in parent so ignore that case for now.
+				 */
+				if(!IsA(var, Var))
+				{
+					can_reuse = false;
+					break;
+				}
+				/* Convert it to parent var */
+				parent_exprs = lappend(parent_exprs, find_parent_var(appinfo, var));
+			}
+			if (can_reuse)
+				rel->uniquekeys = lappend(rel->uniquekeys,
+										  makeUniqueKey(parent_exprs,
+														ukey->multi_nullvals));
+		}
+	}
+	else
+	{
+		/* Check for rule 2 */
+		childrel = linitial_node(RelOptInfo, childrels);
+		foreach(lc, childrel->indexlist)
+		{
+			IndexOptInfo *ind = lfirst(lc);
+			IndexOptInfo *modified_index;
+			if (!ind->unique || !ind->immediate ||
+				(ind->indpred != NIL && !ind->predOK))
+				continue;
+
+			/*
+			 * During simple_copy_indexinfo_to_parent, we need to convert var from
+			 * child var to parent var, index on expression is too complex to handle.
+			 * so ignore it for now.
+			 */
+			if (ind->indexprs != NIL)
+				continue;
+
+			modified_index = simple_copy_indexinfo_to_parent(root, rel, ind);
+			/*
+			 * If the unique index doesn't contain partkey, then it is unique
+			 * on this partition only, so it is useless for us.
+			 */
+			if (!index_constains_partkey(rel, modified_index))
+				continue;
+
+			global_uniq_indexlist = lappend(global_uniq_indexlist,  modified_index);
+		}
+
+		if (global_uniq_indexlist != NIL)
+		{
+			foreach(lc, childrels)
+			{
+				RelOptInfo *child = lfirst(lc);
+				if (is_first)
+				{
+					is_first = false;
+					continue;
+				}
+				adjust_partition_unique_indexlist(root, rel, child, &global_uniq_indexlist);
+			}
+			/* Now we have a list of unique index which are exactly same on all childrels,
+			 * Set the UniqueKey just like it is non-partition table
+			 */
+			populate_baserel_uniquekeys(root, rel, global_uniq_indexlist);
+		}
+	}
+}
+
+
+/*
+ * populate_distinctrel_uniquekeys
+ */
+void
+populate_distinctrel_uniquekeys(PlannerInfo *root,
+								RelOptInfo *inputrel,
+								RelOptInfo *distinctrel)
+{
+	/* The unique key before the distinct is still valid. */
+	distinctrel->uniquekeys = list_copy(inputrel->uniquekeys);
+	add_uniquekey_from_sortgroups(root, distinctrel, root->parse->distinctClause);
+}
+
+/*
+ * populate_grouprel_uniquekeys
+ */
+void
+populate_grouprel_uniquekeys(PlannerInfo *root,
+							 RelOptInfo *grouprel,
+							 RelOptInfo *inputrel)
+
+{
+	Query *parse = root->parse;
+	bool input_ukey_added = false;
+	ListCell *lc;
+
+	if (relation_is_onerow(inputrel))
+	{
+		add_uniquekey_for_onerow(grouprel);
+		return;
+	}
+	if (parse->groupingSets)
+		return;
+
+	/* A Normal group by without grouping set. */
+	if (parse->groupClause)
+	{
+		/*
+		 * Current even the groupby clause is Unique already, but if query has aggref
+		 * We have to create grouprel still. To keep the UnqiueKey short, we will check
+		 * the UniqueKey of input_rel still valid, if so we reuse it.
+		 */
+		foreach(lc, inputrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(ukey->exprs, grouprel->reltarget->exprs))
+			{
+				grouprel->uniquekeys = lappend(grouprel->uniquekeys,
+											   ukey);
+				input_ukey_added = true;
+			}
+		}
+		if (!input_ukey_added)
+			/*
+			 * group by clause must be a super-set of grouprel->reltarget->exprs except the
+			 * aggregation expr, so if such exprs is unique already, no bother to generate
+			 * new uniquekey for group by exprs.
+			 */
+			add_uniquekey_from_sortgroups(root,
+										  grouprel,
+										  root->parse->groupClause);
+	}
+	else
+		/* It has aggregation but without a group by, so only one row returned */
+		add_uniquekey_for_onerow(grouprel);
+}
+
+/*
+ * simple_copy_uniquekeys
+ * Using a function for the one-line code makes us easy to check where we simply
+ * copied the uniquekey.
+ */
+void
+simple_copy_uniquekeys(RelOptInfo *oldrel,
+					   RelOptInfo *newrel)
+{
+	newrel->uniquekeys = oldrel->uniquekeys;
+}
+
+/*
+ *  populate_unionrel_uniquekeys
+ */
+void
+populate_unionrel_uniquekeys(PlannerInfo *root,
+							  RelOptInfo *unionrel)
+{
+	ListCell	*lc;
+	List	*exprs = NIL;
+
+	Assert(unionrel->uniquekeys == NIL);
+
+	foreach(lc, unionrel->reltarget->exprs)
+	{
+		exprs = lappend(exprs, lfirst(lc));
+	}
+
+	if (exprs == NIL)
+		/* SQL: select union select; is valid, we need to handle it here. */
+		add_uniquekey_for_onerow(unionrel);
+	else
+		unionrel->uniquekeys = lappend(unionrel->uniquekeys,
+									   makeUniqueKey(exprs,false));
+
+}
+
+/*
+ * populate_joinrel_uniquekeys
+ *
+ * populate uniquekeys for joinrel. We will check each relation to see if its
+ * UniqueKey is still valid via innerrel_keeps_unique, if so, we add it to
+ * joinrel.  The multi_nullvals field will be changed to true for some outer
+ * join cases and one-row UniqueKey needs to be converted to normal UniqueKey
+ * for the same case as well.
+ * For the uniquekey in either baserel which can't be unique after join, we still
+ * check to see if combination of UniqueKeys from both side is still useful for us.
+ * if yes, we add it to joinrel as well.
+ */
+void
+populate_joinrel_uniquekeys(PlannerInfo *root, RelOptInfo *joinrel,
+							RelOptInfo *outerrel, RelOptInfo *innerrel,
+							List *restrictlist, JoinType jointype)
+{
+	ListCell *lc, *lc2;
+	List	*clause_list = NIL;
+	List	*outerrel_ukey_ctx;
+	List	*innerrel_ukey_ctx;
+	bool	inner_onerow, outer_onerow;
+	bool	mergejoin_allowed;
+
+	/* Care about the outerrel relation only for SEMI/ANTI join */
+	if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
+	{
+		foreach(lc, outerrel->uniquekeys)
+		{
+			UniqueKey	*uniquekey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(uniquekey->exprs, joinrel->reltarget->exprs))
+				joinrel->uniquekeys = lappend(joinrel->uniquekeys, uniquekey);
+		}
+		return;
+	}
+
+	Assert(jointype == JOIN_LEFT || jointype == JOIN_FULL || jointype == JOIN_INNER);
+
+	/* Fast path */
+	if (innerrel->uniquekeys == NIL || outerrel->uniquekeys == NIL)
+		return;
+
+	inner_onerow = relation_is_onerow(innerrel);
+	outer_onerow = relation_is_onerow(outerrel);
+
+	outerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(outerrel);
+	innerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(innerrel);
+
+	clause_list = select_mergejoin_clauses(root, joinrel, outerrel, innerrel,
+										   restrictlist, jointype,
+										   &mergejoin_allowed);
+
+	if (innerrel_keeps_unique(root, innerrel, outerrel, clause_list, true /* reverse */))
+	{
+		bool outer_impact = jointype == JOIN_FULL;
+		foreach(lc, outerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			/* Outer relation has one row, and the unique key is not duplicated after join,
+			 * the joinrel will still has one row unless the jointype == JOIN_FULL.
+			 */
+			if (outer_onerow && !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (outer_onerow)
+			{
+				/*
+				 * The onerow outerrel becomes multi rows and multi_nullvals
+				 * will be changed to true. We also need to set the exprs correctly since it
+				 * can't be NIL any more.
+				 */
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(joinrel, outerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Change multi_nullvals to true due to the full join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs, true));
+				else
+					/* Just reuse it */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	if (innerrel_keeps_unique(root, outerrel, innerrel, clause_list, false))
+	{
+		bool outer_impact = jointype == JOIN_FULL || jointype == JOIN_LEFT;;
+
+		foreach(lc, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			if (inner_onerow &&  !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (inner_onerow)
+			{
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(joinrel, innerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Need to change multi_nullvals to true due to the outer join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs,
+																true));
+				else
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	/*
+	 * The combination of the UniqueKey from both sides is unique as well regardless
+	 * of join type, but no bother to add it if its subset has been added to joinrel
+	 * already or it is not useful for the joinrel.
+	 */
+	foreach(lc, outerrel_ukey_ctx)
+	{
+		UniqueKeyContext ctx1 = (UniqueKeyContext) lfirst(lc);
+		if (ctx1->added_to_joinrel || !ctx1->useful)
+			continue;
+		foreach(lc2, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx2 = (UniqueKeyContext) lfirst(lc2);
+			if (ctx2->added_to_joinrel || !ctx2->useful)
+				continue;
+			if (add_combined_uniquekey(joinrel, outerrel, innerrel,
+									   ctx1->uniquekey, ctx2->uniquekey,
+									   jointype))
+				/* If we set a onerow UniqueKey to joinrel, we don't need other. */
+				return;
+		}
+	}
+}
+
+
+/*
+ * convert_subquery_uniquekeys
+ *
+ * Covert the UniqueKey in subquery to outer relation.
+ */
+void convert_subquery_uniquekeys(PlannerInfo *root,
+								 RelOptInfo *currel,
+								 RelOptInfo *sub_final_rel)
+{
+	ListCell	*lc;
+
+	if (sub_final_rel->uniquekeys == NIL)
+		return;
+
+	if (relation_is_onerow(sub_final_rel))
+	{
+		add_uniquekey_for_onerow(currel);
+		return;
+	}
+
+	Assert(currel->subroot != NULL);
+
+	foreach(lc, sub_final_rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		ListCell	*lc;
+		List	*exprs = NIL;
+		bool	ukey_useful = true;
+
+		/* One row case is handled above */
+		Assert(ukey->exprs != NIL);
+		foreach(lc, ukey->exprs)
+		{
+			Var *var;
+			TargetEntry *tle = tlist_member(lfirst(lc),
+											currel->subroot->processed_tlist);
+			if (tle == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			var = find_var_for_subquery_tle(currel, tle);
+			if (var == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			exprs = lappend(exprs, var);
+		}
+
+		if (ukey_useful)
+			currel->uniquekeys = lappend(currel->uniquekeys,
+										 makeUniqueKey(exprs,
+													   ukey->multi_nullvals));
+
+	}
+}
+
+/*
+ * innerrel_keeps_unique
+ *
+ * Check if Unique key of the innerrel is valid after join. innerrel's UniqueKey
+ * will be still valid if innerrel's any-column mergeop outrerel's uniquekey
+ * exists in clause_list.
+ *
+ * Note: the clause_list must be a list of mergeable restrictinfo already.
+ */
+static bool
+innerrel_keeps_unique(PlannerInfo *root,
+					  RelOptInfo *outerrel,
+					  RelOptInfo *innerrel,
+					  List *clause_list,
+					  bool reverse)
+{
+	ListCell	*lc, *lc2, *lc3;
+
+	if (outerrel->uniquekeys == NIL || innerrel->uniquekeys == NIL)
+		return false;
+
+	/* Check if there is outerrel's uniquekey in mergeable clause. */
+	foreach(lc, outerrel->uniquekeys)
+	{
+		List	*outer_uq_exprs = lfirst_node(UniqueKey, lc)->exprs;
+		bool clauselist_matchs_all_exprs = true;
+		foreach(lc2, outer_uq_exprs)
+		{
+			Node *outer_uq_expr = lfirst(lc2);
+			bool find_uq_expr_in_clauselist = false;
+			foreach(lc3, clause_list)
+			{
+				RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc3);
+				Node *outer_expr;
+				if (reverse)
+					outer_expr = rinfo->outer_is_left ? get_rightop(rinfo->clause) : get_leftop(rinfo->clause);
+				else
+					outer_expr = rinfo->outer_is_left ? get_leftop(rinfo->clause) : get_rightop(rinfo->clause);
+				if (equal(outer_expr, outer_uq_expr))
+				{
+					find_uq_expr_in_clauselist = true;
+					break;
+				}
+			}
+			if (!find_uq_expr_in_clauselist)
+			{
+				/* No need to check the next exprs in the current uniquekey */
+				clauselist_matchs_all_exprs = false;
+				break;
+			}
+		}
+
+		if (clauselist_matchs_all_exprs)
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * relation_is_onerow
+ * Check if it is a one-row relation by checking UniqueKey.
+ */
+bool
+relation_is_onerow(RelOptInfo *rel)
+{
+	UniqueKey *ukey;
+	if (rel->uniquekeys == NIL)
+		return false;
+	ukey = linitial_node(UniqueKey, rel->uniquekeys);
+	return ukey->exprs == NIL && list_length(rel->uniquekeys) == 1;
+}
+
+/*
+ * relation_has_uniquekeys_for
+ *		Returns true if we have proofs that 'rel' cannot return multiple rows with
+ *		the same values in each of 'exprs'.  Otherwise returns false.
+ */
+bool
+relation_has_uniquekeys_for(PlannerInfo *root, RelOptInfo *rel,
+							List *exprs, bool allow_multinulls)
+{
+	ListCell *lc;
+
+	/*
+	 * For UniqueKey->onerow case, the uniquekey->exprs is empty as well
+	 * so we can't rely on list_is_subset to handle this special cases
+	 */
+	if (exprs == NIL)
+		return false;
+
+	foreach(lc, rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		if (ukey->multi_nullvals && !allow_multinulls)
+			continue;
+		if (list_is_subset(ukey->exprs, exprs))
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * get_exprs_from_uniqueindex
+ *
+ * Return a list of exprs which is unique. set useful to false if this
+ * unique index is not useful for us.
+ */
+static List *
+get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+						   List *const_exprs,
+						   List *const_expr_opfamilies,
+						   Bitmapset *used_varattrs,
+						   bool *useful,
+						   bool *multi_nullvals)
+{
+	List	*exprs = NIL;
+	ListCell	*indexpr_item;
+	int	c = 0;
+
+	*useful = true;
+	*multi_nullvals = false;
+
+	indexpr_item = list_head(unique_index->indexprs);
+	for(c = 0; c < unique_index->ncolumns; c++)
+	{
+		int attr = unique_index->indexkeys[c];
+		Expr *expr;
+		bool	matched_const = false;
+		ListCell	*lc1, *lc2;
+
+		if(attr > 0)
+		{
+			expr = list_nth_node(TargetEntry, unique_index->indextlist, c)->expr;
+		}
+		else if (attr == 0)
+		{
+			/* Expression index */
+			expr = lfirst(indexpr_item);
+			indexpr_item = lnext(unique_index->indexprs, indexpr_item);
+		}
+		else /* attr < 0 */
+		{
+			/* Index on system column is not supported */
+			Assert(false);
+		}
+
+		/*
+		 * Check index_col = Const case with regarding to opfamily checking
+		 * If we can remove the index_col from the final UniqueKey->exprs.
+		 */
+		forboth(lc1, const_exprs, lc2, const_expr_opfamilies)
+		{
+			if (list_member_oid((List *)lfirst(lc2), unique_index->opfamily[c])
+				&& match_index_to_operand((Node *) lfirst(lc1), c, unique_index))
+			{
+				matched_const = true;
+				break;
+			}
+		}
+
+		if (matched_const)
+			continue;
+
+		/* Check if the indexed expr is used in rel */
+		if (attr > 0)
+		{
+			/*
+			 * Normal Indexed column, if the col is not used, then the index is useless
+			 * for uniquekey.
+			 */
+			attr -= FirstLowInvalidHeapAttributeNumber;
+
+			if (!bms_is_member(attr, used_varattrs))
+			{
+				*useful = false;
+				break;
+			}
+		}
+		else if (!list_member(unique_index->rel->reltarget->exprs, expr))
+		{
+			/* Expression index but the expression is not used in rel */
+			*useful = false;
+			break;
+		}
+
+		/* check not null property. */
+		if (attr == 0)
+		{
+			/* We never know if a expression yields null or not */
+			*multi_nullvals = true;
+		}
+		else if (!bms_is_member(attr, unique_index->rel->notnullattrs)
+				 && !bms_is_member(0 - FirstLowInvalidHeapAttributeNumber,
+								   unique_index->rel->notnullattrs))
+		{
+			*multi_nullvals = true;
+		}
+
+		exprs = lappend(exprs, expr);
+	}
+	return exprs;
+}
+
+
+/*
+ * add_uniquekey_for_onerow
+ * If we are sure that the relation only returns one row, then all the columns
+ * are unique. However we don't need to create UniqueKey for every column, we
+ * just set exprs = NIL and overwrites all the other UniqueKey on this RelOptInfo
+ * since this one has strongest semantics.
+ */
+void
+add_uniquekey_for_onerow(RelOptInfo *rel)
+{
+	/*
+	 * We overwrite the previous UniqueKey on purpose since this one has the
+	 * strongest semantic.
+	 */
+	rel->uniquekeys = list_make1(makeUniqueKey(NIL, false));
+}
+
+
+/*
+ * initililze_uniquecontext_for_joinrel
+ * Return a List of UniqueKeyContext for an inputrel
+ */
+static List *
+initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel)
+{
+	List	*res = NIL;
+	ListCell *lc;
+	foreach(lc,  inputrel->uniquekeys)
+	{
+		UniqueKeyContext context;
+		context = palloc(sizeof(struct UniqueKeyContextData));
+		context->uniquekey = lfirst_node(UniqueKey, lc);
+		context->added_to_joinrel = false;
+		context->useful = true;
+		res = lappend(res, context);
+	}
+	return res;
+}
+
+
+/*
+ * get_exprs_from_uniquekey
+ *	Unify the way of get List of exprs from a one-row UniqueKey or
+ * normal UniqueKey. for the onerow case, every expr in rel1 is a valid
+ * UniqueKey. Return a List of exprs.
+ *
+ * rel1: The relation which you want to get the exprs.
+ * ukey: The UniqueKey you want to get the exprs.
+ */
+static List *
+get_exprs_from_uniquekey(RelOptInfo *joinrel, RelOptInfo *rel1, UniqueKey *ukey)
+{
+	ListCell *lc;
+	bool onerow = rel1 != NULL && relation_is_onerow(rel1);
+
+	List	*res = NIL;
+	Assert(onerow || ukey);
+	if (onerow)
+	{
+		/* Only cares about the exprs still exist in joinrel */
+		foreach(lc, joinrel->reltarget->exprs)
+		{
+			Bitmapset *relids = pull_varnos(lfirst(lc));
+			if (bms_is_subset(relids, rel1->relids))
+			{
+				res = lappend(res, list_make1(lfirst(lc)));
+			}
+		}
+	}
+	else
+	{
+		res = list_make1(ukey->exprs);
+	}
+	return res;
+}
+
+/*
+ * Partitioned table Unique Keys.
+ * The partition table unique key is maintained as:
+ * 1. The index must be unique as usual.
+ * 2. The index must contains partition key.
+ * 3. The index must exist on all the child rel. see simple_indexinfo_equal for
+ *    how we compare it.
+ */
+
+/*
+ * index_constains_partkey
+ * return true if the index contains the partiton key.
+ */
+static bool
+index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind)
+{
+	ListCell	*lc;
+	int	i;
+	Assert(IS_PARTITIONED_REL(partrel));
+	Assert(partrel->part_scheme->partnatts > 0);
+
+	for(i = 0; i < partrel->part_scheme->partnatts; i++)
+	{
+		Node *part_expr = linitial(partrel->partexprs[i]);
+		bool found_in_index = false;
+		foreach(lc, ind->indextlist)
+		{
+			Expr *index_expr = lfirst_node(TargetEntry, lc)->expr;
+			if (equal(index_expr, part_expr))
+			{
+				found_in_index = true;
+				break;
+			}
+		}
+		if (!found_in_index)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * simple_indexinfo_equal
+ *
+ * Used to check if the 2 index is same as each other. The index here
+ * is COPIED from childrel and did some tiny changes(see
+ * simple_copy_indexinfo_to_parent)
+ */
+static bool
+simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2)
+{
+	Size oid_cmp_len = sizeof(Oid) * ind1->ncolumns;
+
+	return ind1->ncolumns == ind2->ncolumns &&
+		ind1->unique == ind2->unique &&
+		memcmp(ind1->indexkeys, ind2->indexkeys, sizeof(int) * ind1->ncolumns) == 0 &&
+		memcmp(ind1->opfamily, ind2->opfamily, oid_cmp_len) == 0 &&
+		memcmp(ind1->opcintype, ind2->opcintype, oid_cmp_len) == 0 &&
+		memcmp(ind1->sortopfamily, ind2->sortopfamily, oid_cmp_len) == 0 &&
+		equal(get_tlist_exprs(ind1->indextlist, true),
+			  get_tlist_exprs(ind2->indextlist, true));
+}
+
+
+/*
+ * The below macros are used for simple_copy_indexinfo_to_parent which is so
+ * customized that I don't want to put it to copyfuncs.c. So copy it here.
+ */
+#define COPY_POINTER_FIELD(fldname, sz) \
+	do { \
+		Size	_size = (sz); \
+		newnode->fldname = palloc(_size); \
+		memcpy(newnode->fldname, from->fldname, _size); \
+	} while (0)
+
+#define COPY_NODE_FIELD(fldname) \
+	(newnode->fldname = copyObjectImpl(from->fldname))
+
+#define COPY_SCALAR_FIELD(fldname) \
+	(newnode->fldname = from->fldname)
+
+
+/*
+ * simple_copy_indexinfo_to_parent (from partition)
+ * Copy the IndexInfo from child relation to parent relation with some modification,
+ * which is used to test:
+ * 1. If the same index exists in all the childrels.
+ * 2. If the parentrel->reltarget/basicrestrict info matches this index.
+ */
+static IndexOptInfo *
+simple_copy_indexinfo_to_parent(PlannerInfo *root,
+								RelOptInfo *parentrel,
+								IndexOptInfo *from)
+{
+	IndexOptInfo *newnode = makeNode(IndexOptInfo);
+	AppendRelInfo *appinfo = find_appinfo_by_child(root, from->rel->relid);
+	ListCell	*lc;
+	int	idx = 0;
+
+	COPY_SCALAR_FIELD(ncolumns);
+	COPY_SCALAR_FIELD(nkeycolumns);
+	COPY_SCALAR_FIELD(unique);
+	COPY_SCALAR_FIELD(immediate);
+	/* We just need to know if it is NIL or not */
+	COPY_SCALAR_FIELD(indpred);
+	COPY_SCALAR_FIELD(predOK);
+	COPY_POINTER_FIELD(indexkeys, from->ncolumns * sizeof(int));
+	COPY_POINTER_FIELD(indexcollations, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opfamily, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opcintype, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(sortopfamily, from->ncolumns * sizeof(Oid));
+	COPY_NODE_FIELD(indextlist);
+
+	/* Convert index exprs on child expr to expr on parent */
+	foreach(lc, newnode->indextlist)
+	{
+		TargetEntry *tle = lfirst_node(TargetEntry, lc);
+		/* Index on expression is ignored */
+		Assert(IsA(tle->expr, Var));
+		tle->expr = (Expr *) find_parent_var(appinfo, (Var *) tle->expr);
+		newnode->indexkeys[idx] = castNode(Var, tle->expr)->varattno;
+		idx++;
+	}
+	newnode->rel = parentrel;
+	return newnode;
+}
+
+/*
+ * adjust_partition_unique_indexlist
+ *
+ * global_unique_indexes: At the beginning, it contains the copy & modified
+ * unique index from the first partition. And then check if each index in it still
+ * exists in the following partitions. If no, remove it. at last, it has an
+ * index list which exists in all the partitions.
+ */
+static void
+adjust_partition_unique_indexlist(PlannerInfo *root,
+								  RelOptInfo *parentrel,
+								  RelOptInfo *childrel,
+								  List **global_unique_indexes)
+{
+	ListCell	*lc, *lc2;
+	foreach(lc, *global_unique_indexes)
+	{
+		IndexOptInfo	*g_ind = lfirst_node(IndexOptInfo, lc);
+		bool found_in_child = false;
+
+		foreach(lc2, childrel->indexlist)
+		{
+			IndexOptInfo   *p_ind = lfirst_node(IndexOptInfo, lc2);
+			IndexOptInfo   *p_ind_copy;
+			if (!p_ind->unique || !p_ind->immediate ||
+				(p_ind->indpred != NIL && !p_ind->predOK))
+				continue;
+			p_ind_copy = simple_copy_indexinfo_to_parent(root, parentrel, p_ind);
+			if (simple_indexinfo_equal(p_ind_copy, g_ind))
+			{
+				found_in_child = true;
+				break;
+			}
+		}
+		if (!found_in_child)
+			/* The index doesn't exist in childrel, remove it from global_unique_indexes */
+			*global_unique_indexes = foreach_delete_current(*global_unique_indexes, lc);
+	}
+}
+
+/* Helper function for groupres/distinctrel */
+static void
+add_uniquekey_from_sortgroups(PlannerInfo *root, RelOptInfo *rel, List *sortgroups)
+{
+	Query *parse = root->parse;
+	List	*exprs;
+
+	/*
+	 * XXX: If there are some vars which is not in current levelsup, the semantic is
+	 * imprecise, should we avoid it or not? levelsup = 1 is just a demo, maybe we need to
+	 * check every level other than 0, if so, looks we have to write another
+	 * pull_var_walker.
+	 */
+	List	*upper_vars = pull_vars_of_level((Node*)sortgroups, 1);
+
+	if (upper_vars != NIL)
+		return;
+
+	exprs = get_sortgrouplist_exprs(sortgroups, parse->targetList);
+	rel->uniquekeys = lappend(rel->uniquekeys,
+							  makeUniqueKey(exprs,
+											false /* sortgroupclause can't be multi_nullvals */));
+}
+
+
+/*
+ * add_combined_uniquekey
+ * The combination of both UniqueKeys is a valid UniqueKey for joinrel no matter
+ * the jointype.
+ */
+bool
+add_combined_uniquekey(RelOptInfo *joinrel,
+					   RelOptInfo *outer_rel,
+					   RelOptInfo *inner_rel,
+					   UniqueKey *outer_ukey,
+					   UniqueKey *inner_ukey,
+					   JoinType jointype)
+{
+
+	ListCell	*lc1, *lc2;
+
+	/* Either side has multi_nullvals or we have outer join,
+	 * the combined UniqueKey has multi_nullvals */
+	bool multi_nullvals = outer_ukey->multi_nullvals ||
+		inner_ukey->multi_nullvals || IS_OUTER_JOIN(jointype);
+
+	/* The only case we can get onerow joinrel after join */
+	if  (relation_is_onerow(outer_rel)
+		 && relation_is_onerow(inner_rel)
+		 && jointype == JOIN_INNER)
+	{
+		add_uniquekey_for_onerow(joinrel);
+		return true;
+	}
+
+	foreach(lc1, get_exprs_from_uniquekey(joinrel, outer_rel, outer_ukey))
+	{
+		foreach(lc2, get_exprs_from_uniquekey(joinrel, inner_rel, inner_ukey))
+		{
+			List *exprs = list_concat_copy(lfirst_node(List, lc1), lfirst_node(List, lc2));
+			joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+										  makeUniqueKey(exprs,
+														multi_nullvals));
+		}
+	}
+	return false;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b406d41e91..0551ae0512 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2389,6 +2389,8 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 		add_path(final_rel, path);
 	}
 
+	simple_copy_uniquekeys(current_rel, final_rel);
+
 	/*
 	 * Generate partial paths for final_rel, too, if outer query levels might
 	 * be able to make use of them.
@@ -3899,6 +3901,8 @@ create_grouping_paths(PlannerInfo *root,
 	}
 
 	set_cheapest(grouped_rel);
+
+	populate_grouprel_uniquekeys(root, grouped_rel, input_rel);
 	return grouped_rel;
 }
 
@@ -4615,7 +4619,7 @@ create_window_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(window_rel);
-
+	simple_copy_uniquekeys(input_rel, window_rel);
 	return window_rel;
 }
 
@@ -4911,7 +4915,7 @@ create_distinct_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(distinct_rel);
-
+	populate_distinctrel_uniquekeys(root, input_rel, distinct_rel);
 	return distinct_rel;
 }
 
@@ -5172,6 +5176,8 @@ create_ordered_paths(PlannerInfo *root,
 	 */
 	Assert(ordered_rel->pathlist != NIL);
 
+	simple_copy_uniquekeys(input_rel, ordered_rel);
+
 	return ordered_rel;
 }
 
@@ -6049,6 +6055,9 @@ adjust_paths_for_srfs(PlannerInfo *root, RelOptInfo *rel,
 	if (list_length(targets) == 1)
 		return;
 
+	/* UniqueKey is not valid after handling the SRF. */
+	rel->uniquekeys = NIL;
+
 	/*
 	 * Stack SRF-evaluation nodes atop each path for the rel.
 	 *
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 951aed80e7..e94e92937c 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -689,6 +689,8 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/* Undo effects of possibly forcing tuple_fraction to 0 */
 	root->tuple_fraction = save_fraction;
 
+	/* Add the UniqueKeys */
+	populate_unionrel_uniquekeys(root, result_rel);
 	return result_rel;
 }
 
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index d722063cf3..44c37ecffc 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -746,3 +746,47 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
 	}
 	return appinfos;
 }
+
+/*
+ * find_appinfo_by_child
+ *
+ */
+AppendRelInfo *
+find_appinfo_by_child(PlannerInfo *root, Index child_index)
+{
+	ListCell	*lc;
+	foreach(lc, root->append_rel_list)
+	{
+		AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
+		if (appinfo->child_relid == child_index)
+			return appinfo;
+	}
+	elog(ERROR, "parent relation cant be found");
+	return NULL;
+}
+
+/*
+ * find_parent_var
+ *
+ */
+Var *
+find_parent_var(AppendRelInfo *appinfo, Var *child_var)
+{
+	ListCell	*lc;
+	Var	*res = NULL;
+	Index attno = 1;
+	foreach(lc, appinfo->translated_vars)
+	{
+		Node *child_node = lfirst(lc);
+		if (equal(child_node, child_var))
+		{
+			res = copyObject(child_var);
+			res->varattno = attno;
+			res->varno = appinfo->parent_relid;
+		}
+		attno++;
+	}
+	if (res == NULL)
+		elog(ERROR, "parent var can't be found.");
+	return res;
+}
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 3132fd35a5..d66b40ec50 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -736,6 +736,7 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 		{
 			Node	   *onecq = (Node *) lfirst(lc2);
 			bool		pseudoconstant;
+			RestrictInfo	*child_rinfo;
 
 			/* check for pseudoconstant (no Vars or volatile functions) */
 			pseudoconstant =
@@ -747,13 +748,14 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 				root->hasPseudoConstantQuals = true;
 			}
 			/* reconstitute RestrictInfo with appropriate properties */
-			childquals = lappend(childquals,
-								 make_restrictinfo((Expr *) onecq,
-												   rinfo->is_pushed_down,
-												   rinfo->outerjoin_delayed,
-												   pseudoconstant,
-												   rinfo->security_level,
-												   NULL, NULL, NULL));
+			child_rinfo =  make_restrictinfo((Expr *) onecq,
+											 rinfo->is_pushed_down,
+											 rinfo->outerjoin_delayed,
+											 pseudoconstant,
+											 rinfo->security_level,
+											 NULL, NULL, NULL);
+			child_rinfo->mergeopfamilies = rinfo->mergeopfamilies;
+			childquals = lappend(childquals, child_rinfo);
 			/* track minimum security level among child quals */
 			cq_min_security = Min(cq_min_security, rinfo->security_level);
 		}
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 31d9aedeeb..c83f17acb7 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -16,6 +16,7 @@
 
 #include "nodes/execnodes.h"
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 
 extern A_Expr *makeA_Expr(A_Expr_Kind kind, List *name,
@@ -105,4 +106,6 @@ extern GroupingSet *makeGroupingSet(GroupingSetKind kind, List *content, int loc
 
 extern VacuumRelation *makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols);
 
+extern UniqueKey* makeUniqueKey(List *exprs, bool multi_nullvals);
+
 #endif							/* MAKEFUNC_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 381d84b4e4..41110ed888 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -264,6 +264,7 @@ typedef enum NodeTag
 	T_EquivalenceMember,
 	T_PathKey,
 	T_PathTarget,
+	T_UniqueKey,
 	T_RestrictInfo,
 	T_IndexClause,
 	T_PlaceHolderVar,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 9e3ebd488a..02e4458bef 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -730,6 +730,7 @@ typedef struct RelOptInfo
 	QualCost	baserestrictcost;	/* cost of evaluating the above */
 	Index		baserestrict_min_security;	/* min security_level found in
 											 * baserestrictinfo */
+	List	   *uniquekeys;		/* List of UniqueKey */
 	List	   *joininfo;		/* RestrictInfo structures for join clauses
 								 * involving this rel */
 	bool		has_eclass_joins;	/* T means joininfo is incomplete */
@@ -1047,6 +1048,28 @@ typedef struct PathKey
 } PathKey;
 
 
+/*
+ * UniqueKey
+ *
+ * Represents the unique properties held by a RelOptInfo.
+ *
+ * exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+ * is a special case of UniqueKey, which means there is only 1 row in that
+ * relation.
+ * multi_nullvals: true means multi null values may exist in these exprs, so the
+ * uniqueness is not guaranteed in this case. This field is necessary for
+ * remove_useless_join & reduce_unique_semijoins where we don't mind these
+ * duplicated NULL values. It is set to true for 2 cases. One is a unique key
+ * from a unique index but the related column is nullable. The other one is for
+ * outer join. see populate_joinrel_uniquekeys for detail.
+ */
+typedef struct UniqueKey
+{
+	NodeTag		type;
+	List	   *exprs;
+	bool		multi_nullvals;
+} UniqueKey;
+
 /*
  * PathTarget
  *
@@ -2473,7 +2496,7 @@ typedef enum
  *
  * flags indicating what kinds of grouping are possible.
  * partial_costs_set is true if the agg_partial_costs and agg_final_costs
- * 		have been initialized.
+ *		have been initialized.
  * agg_partial_costs gives partial aggregation costs.
  * agg_final_costs gives finalization costs.
  * target_parallel_safe is true if target is parallel safe.
@@ -2503,8 +2526,8 @@ typedef struct
  * limit_tuples is an estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate.
  * count_est and offset_est are the estimated values of the LIMIT and OFFSET
- * 		expressions computed by preprocess_limit() (see comments for
- * 		preprocess_limit() for more information).
+ *		expressions computed by preprocess_limit() (see comments for
+ *		preprocess_limit() for more information).
  */
 typedef struct
 {
diff --git a/src/include/nodes/pg_list.h b/src/include/nodes/pg_list.h
index 14ea2766ad..621f54a9f8 100644
--- a/src/include/nodes/pg_list.h
+++ b/src/include/nodes/pg_list.h
@@ -528,6 +528,8 @@ extern bool list_member_ptr(const List *list, const void *datum);
 extern bool list_member_int(const List *list, int datum);
 extern bool list_member_oid(const List *list, Oid datum);
 
+extern bool list_is_subset(const List *members, const List *target);
+
 extern List *list_delete(List *list, void *datum);
 extern List *list_delete_ptr(List *list, void *datum);
 extern List *list_delete_int(List *list, int datum);
diff --git a/src/include/optimizer/appendinfo.h b/src/include/optimizer/appendinfo.h
index d6a27a60dd..e87c92a054 100644
--- a/src/include/optimizer/appendinfo.h
+++ b/src/include/optimizer/appendinfo.h
@@ -32,4 +32,7 @@ extern Relids adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
 extern AppendRelInfo **find_appinfos_by_relids(PlannerInfo *root,
 											   Relids relids, int *nappinfos);
 
+extern AppendRelInfo *find_appinfo_by_child(PlannerInfo *root, Index child_index);
+extern Var *find_parent_var(AppendRelInfo *appinfo, Var *child_var);
+
 #endif							/* APPENDINFO_H */
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index 3e4171056e..9445141263 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -23,6 +23,7 @@
 #define OPTIMIZER_H
 
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 /*
  * We don't want to include nodes/pathnodes.h here, because non-planner
@@ -156,6 +157,7 @@ extern TargetEntry *get_sortgroupref_tle(Index sortref,
 										 List *targetList);
 extern TargetEntry *get_sortgroupclause_tle(SortGroupClause *sgClause,
 											List *targetList);
+extern Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 extern Node *get_sortgroupclause_expr(SortGroupClause *sgClause,
 									  List *targetList);
 extern List *get_sortgrouplist_exprs(List *sgClauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 10b6e81079..9217a8d6c6 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -240,5 +240,48 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 									   int strategy, bool nulls_first);
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 									List *live_childrels);
+extern List *select_mergejoin_clauses(PlannerInfo *root,
+									  RelOptInfo *joinrel,
+									  RelOptInfo *outerrel,
+									  RelOptInfo *innerrel,
+									  List *restrictlist,
+									  JoinType jointype,
+									  bool *mergejoin_allowed);
+
+/*
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ */
+extern void populate_baserel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *baserel,
+										List* unique_index_list);
+extern void populate_partitionedrel_uniquekeys(PlannerInfo *root,
+												RelOptInfo *rel,
+												List *childrels);
+extern void populate_distinctrel_uniquekeys(PlannerInfo *root,
+											RelOptInfo *inputrel,
+											RelOptInfo *distinctrel);
+extern void populate_grouprel_uniquekeys(PlannerInfo *root,
+										 RelOptInfo *grouprel,
+										 RelOptInfo *inputrel);
+extern void populate_unionrel_uniquekeys(PlannerInfo *root,
+										  RelOptInfo *unionrel);
+extern void simple_copy_uniquekeys(RelOptInfo *oldrel,
+								   RelOptInfo *newrel);
+extern void convert_subquery_uniquekeys(PlannerInfo *root,
+										RelOptInfo *currel,
+										RelOptInfo *sub_final_rel);
+extern void populate_joinrel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *joinrel,
+										RelOptInfo *rel1,
+										RelOptInfo *rel2,
+										List *restrictlist,
+										JoinType jointype);
+
+extern bool relation_has_uniquekeys_for(PlannerInfo *root,
+										RelOptInfo *rel,
+										List *exprs,
+										bool allow_multinulls);
+extern bool relation_is_onerow(RelOptInfo *rel);
 
 #endif							/* PATHS_H */
-- 
2.25.0

0003-Extend-UniqueKeys.patchapplication/octet-stream; name=0003-Extend-UniqueKeys.patchDownload

From 6647da1f1c6886fba396babcafad2c3f192ea60e Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:33:56 +0200
Subject: [PATCH 3/7] Extend UniqueKeys

Prepares index skip scan implementation using UniqueKeys. Allows to
specify what are the "requested" keys that should be unique, and add
them to necessary Paths to make them useful later.

Proposed by David Rowley, contains few bits out of previous version from
Jesper Pedersen.
---
 src/backend/optimizer/path/pathkeys.c   | 59 +++++++++++++++++++++
 src/backend/optimizer/path/uniquekeys.c | 68 +++++++++++++++++++++++++
 src/backend/optimizer/plan/planner.c    | 38 +++++++++++++-
 src/backend/optimizer/util/pathnode.c   | 32 +++++++++---
 src/include/nodes/pathnodes.h           |  5 ++
 src/include/optimizer/pathnode.h        |  1 +
 src/include/optimizer/paths.h           |  8 +++
 7 files changed, 201 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 7e596d4194..a4fc4f252d 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
 #include "utils/lsyscache.h"
 
 
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
@@ -95,6 +96,29 @@ make_canonical_pathkey(PlannerInfo *root,
 	return pk;
 }
 
+/*
+ * pathkey_is_unique
+ *	   Checks if the new pathkey's equivalence class is the same as that of
+ *     any existing member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+	EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+	ListCell   *lc;
+
+	/* If same EC already is already in the list, then not unique */
+	foreach(lc, pathkeys)
+	{
+		PathKey    *old_pathkey = (PathKey *) lfirst(lc);
+
+		if (new_ec == old_pathkey->pk_eclass)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * pathkey_is_redundant
  *	   Is a pathkey redundant with one already in the given list?
@@ -1151,6 +1175,41 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
 	return pathkeys;
 }
 
+/*
+ * make_pathkeys_for_uniquekeyclauses
+ *		Generate a pathkeys list to be used for uniquekey clauses
+ */
+List *
+make_pathkeys_for_uniquekeys(PlannerInfo *root,
+							 List *sortclauses,
+							 List *tlist)
+{
+	List	   *pathkeys = NIL;
+	ListCell   *l;
+
+	foreach(l, sortclauses)
+	{
+		SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+		Expr	   *sortkey;
+		PathKey    *pathkey;
+
+		sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+		Assert(OidIsValid(sortcl->sortop));
+		pathkey = make_pathkey_from_sortop(root,
+										   sortkey,
+										   root->nullable_baserels,
+										   sortcl->sortop,
+										   sortcl->nulls_first,
+										   sortcl->tleSortGroupRef,
+										   true);
+
+		if (pathkey_is_unique(pathkey, pathkeys))
+			pathkeys = lappend(pathkeys, pathkey);
+	}
+
+	return pathkeys;
+}
+
 /****************************************************************************
  *		PATHKEYS AND MERGECLAUSES
  ****************************************************************************/
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
index b33bcd2f32..7d9c0aefbf 100644
--- a/src/backend/optimizer/path/uniquekeys.c
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -1129,3 +1129,71 @@ add_combined_uniquekey(RelOptInfo *joinrel,
 	}
 	return false;
 }
+
+List*
+build_uniquekeys(PlannerInfo *root, List *sortclauses)
+{
+	List *result = NIL;
+	List *sortkeys;
+	ListCell *l;
+
+	sortkeys = make_pathkeys_for_uniquekeys(root,
+											sortclauses,
+											root->processed_tlist);
+
+	/* Create a uniquekey and add it to the list */
+	foreach(l, sortkeys)
+	{
+		PathKey    *pathkey = (PathKey *) lfirst(l);
+		EquivalenceClass *ec = pathkey->pk_eclass;
+		ListCell   *k;
+		List *exprs = NIL;
+
+		foreach(k, ec->ec_members)
+		{
+			EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
+			exprs = lappend(exprs, mem->em_expr);
+		}
+
+		result = lappend(result, makeUniqueKey(exprs, false, false));
+	}
+
+	return result;
+}
+
+bool
+query_has_uniquekeys_for(PlannerInfo *root, List *pathkeys,
+						 bool allow_multinulls)
+{
+	ListCell *lc;
+	List *exprs = NIL;
+
+	/* For UniqueKey->onerow case, the uniquekey->exprs is empty as well
+	 * so we can't rely on list_is_subset to handle this special cases
+	 */
+	if (pathkeys == NIL)
+		return false;
+
+	foreach(lc, pathkeys)
+	{
+		PathKey    *pathkey = (PathKey *) lfirst(lc);
+		EquivalenceClass *ec = pathkey->pk_eclass;
+		ListCell   *k;
+
+		foreach(k, ec->ec_members)
+		{
+			EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
+			exprs = lappend(exprs, mem->em_expr);
+		}
+	}
+
+	foreach(lc, root->query_uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		if (ukey->multi_nullvals && !allow_multinulls)
+			continue;
+		if (list_is_subset(ukey->exprs, exprs))
+			return true;
+	}
+	return false;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 0551ae0512..a5f3c28fb5 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3663,9 +3663,17 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 		root->query_pathkeys = root->window_pathkeys;
 	else if (list_length(root->distinct_pathkeys) >
 			 list_length(root->sort_pathkeys))
+	{
 		root->query_pathkeys = root->distinct_pathkeys;
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else if (root->sort_pathkeys)
+	{
 		root->query_pathkeys = root->sort_pathkeys;
+
+		if (root->distinct_pathkeys)
+			root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
 		root->query_pathkeys = NIL;
 }
@@ -4815,13 +4823,19 @@ create_distinct_paths(PlannerInfo *root,
 			Path	   *path = (Path *) lfirst(lc);
 
 			if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
-			{
 				add_path(distinct_rel, (Path *)
 						 create_upper_unique_path(root, distinct_rel,
 												  path,
 												  list_length(root->distinct_pathkeys),
 												  numDistinctRows));
-			}
+		}
+
+		foreach(lc, input_rel->unique_pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+
+			if (query_has_uniquekeys_for(root, needed_pathkeys, false))
+				add_path(distinct_rel, path);
 		}
 
 		/* For explicit-sort case, always use the more rigorous clause */
@@ -7517,6 +7531,26 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		}
 	}
 
+	foreach(lc, rel->unique_pathlist)
+	{
+		Path	   *subpath = (Path *) lfirst(lc);
+
+		/* Shouldn't have any parameterized paths anymore */
+		Assert(subpath->param_info == NULL);
+
+		if (tlist_same_exprs)
+			subpath->pathtarget->sortgrouprefs =
+				scanjoin_target->sortgrouprefs;
+		else
+		{
+			Path	   *newpath;
+
+			newpath = (Path *) create_projection_path(root, rel, subpath,
+													  scanjoin_target);
+			lfirst(lc) = newpath;
+		}
+	}
+
 	/*
 	 * Now, if final scan/join target contains SRFs, insert ProjectSetPath(s)
 	 * atop each existing path.  (Note that this function doesn't look at the
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e845a4b1ae..60656bdca5 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -416,10 +416,10 @@ set_cheapest(RelOptInfo *parent_rel)
  * 'parent_rel' is the relation entry to which the path corresponds.
  * 'new_path' is a potential path for parent_rel.
  *
- * Returns nothing, but modifies parent_rel->pathlist.
+ * Returns modified pathlist.
  */
-void
-add_path(RelOptInfo *parent_rel, Path *new_path)
+static List *
+add_path_to(RelOptInfo *parent_rel, List *pathlist, Path *new_path)
 {
 	bool		accept_new = true;	/* unless we find a superior old path */
 	int			insert_at = 0;	/* where to insert new item */
@@ -440,7 +440,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	 * for more than one old path to be tossed out because new_path dominates
 	 * it.
 	 */
-	foreach(p1, parent_rel->pathlist)
+	foreach(p1, pathlist)
 	{
 		Path	   *old_path = (Path *) lfirst(p1);
 		bool		remove_old = false; /* unless new proves superior */
@@ -584,8 +584,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		 */
 		if (remove_old)
 		{
-			parent_rel->pathlist = foreach_delete_current(parent_rel->pathlist,
-														  p1);
+			pathlist = foreach_delete_current(pathlist, p1);
 
 			/*
 			 * Delete the data pointed-to by the deleted cell, if possible
@@ -612,8 +611,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	if (accept_new)
 	{
 		/* Accept the new path: insert it at proper place in pathlist */
-		parent_rel->pathlist =
-			list_insert_nth(parent_rel->pathlist, insert_at, new_path);
+		pathlist = list_insert_nth(pathlist, insert_at, new_path);
 	}
 	else
 	{
@@ -621,6 +619,23 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		if (!IsA(new_path, IndexPath))
 			pfree(new_path);
 	}
+
+	return pathlist;
+}
+
+void
+add_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->pathlist = add_path_to(parent_rel,
+									   parent_rel->pathlist, new_path);
+}
+
+void
+add_unique_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->unique_pathlist = add_path_to(parent_rel,
+											  parent_rel->unique_pathlist,
+											  new_path);
 }
 
 /*
@@ -2539,6 +2554,7 @@ create_projection_path(PlannerInfo *root,
 	pathnode->path.pathkeys = subpath->pathkeys;
 
 	pathnode->subpath = subpath;
+	pathnode->path.uniquekeys = subpath->uniquekeys;
 
 	/*
 	 * We might not need a separate Result node.  If the input plan node type
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 02e4458bef..a5c406bd4e 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -297,6 +297,7 @@ struct PlannerInfo
 
 	List	   *query_pathkeys; /* desired pathkeys for query_planner() */
 
+	List	   *query_uniquekeys; /* unique keys required for the query */
 	List	   *group_pathkeys; /* groupClause pathkeys, if any */
 	List	   *window_pathkeys;	/* pathkeys of bottom window, if any */
 	List	   *distinct_pathkeys;	/* distinctClause pathkeys, if any */
@@ -679,6 +680,7 @@ typedef struct RelOptInfo
 	List	   *pathlist;		/* Path structures */
 	List	   *ppilist;		/* ParamPathInfos used in pathlist */
 	List	   *partial_pathlist;	/* partial Paths */
+	List       *unique_pathlist;    /* unique Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
 	struct Path *cheapest_unique_path;
@@ -866,6 +868,7 @@ struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanskip;		/* can AM skip duplicate values? */
 	bool		amcanparallel;	/* does AM support parallel scan? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
 	void		(*amcostestimate) ();	/* AM's cost estimator */
@@ -1182,6 +1185,8 @@ typedef struct Path
 
 	List	   *pathkeys;		/* sort ordering of path's output */
 	/* pathkeys is a List of PathKey nodes; see above */
+
+	List	   *uniquekeys;	/* the unique keys, or NIL if none */
 } Path;
 
 /* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 715a24ad29..6796ad8cb7 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -27,6 +27,7 @@ extern int	compare_fractional_path_costs(Path *path1, Path *path2,
 										  double fraction);
 extern void set_cheapest(RelOptInfo *parent_rel);
 extern void add_path(RelOptInfo *parent_rel, Path *new_path);
+extern void add_unique_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_path_precheck(RelOptInfo *parent_rel,
 							  Cost startup_cost, Cost total_cost,
 							  List *pathkeys, Relids required_outer);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9217a8d6c6..0cb8030e33 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -215,6 +215,9 @@ extern List *build_join_pathkeys(PlannerInfo *root,
 extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
 										   List *sortclauses,
 										   List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+										  List *sortclauses,
+										  List *tlist);
 extern void initialize_mergeclause_eclasses(PlannerInfo *root,
 											RestrictInfo *restrictinfo);
 extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -282,6 +285,11 @@ extern bool relation_has_uniquekeys_for(PlannerInfo *root,
 										RelOptInfo *rel,
 										List *exprs,
 										bool allow_multinulls);
+extern bool query_has_uniquekeys_for(PlannerInfo *root,
+									 List *exprs,
+									 bool allow_multinulls);
 extern bool relation_is_onerow(RelOptInfo *rel);
 
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+
 #endif							/* PATHS_H */
-- 
2.25.0

0004-Index-skip-scan.patchapplication/octet-stream; name=0004-Index-skip-scan.patchDownload

From 9b853b827e8c27d5fbc89692b66f00ba32fcd8f5 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:39:13 +0200
Subject: [PATCH 4/7] Index skip scan

Implementation of basic Index Skip Scan (see Loose Index Scan in the
wiki [1]) infrastructure on top of IndexOnlyScan and IndexScan.
Introduces a new index am function amskip to allow advance past
duplicate key values in a scan. This innocently looking description
could be a bit tricky on the edge between am specific and common parts
of the implementation, mostly due to different information available at
each level, e.g. visibility. This means the common parts should apply
skipping multiple times if necessary.

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 contrib/bloom/blutils.c                       |  1 +
 doc/src/sgml/indexam.sgml                     |  1 +
 src/backend/access/brin/brin.c                |  1 +
 src/backend/access/gin/ginutil.c              |  1 +
 src/backend/access/gist/gist.c                |  1 +
 src/backend/access/hash/hash.c                |  1 +
 src/backend/access/index/indexam.c            | 18 ++++
 src/backend/access/spgist/spgutils.c          |  1 +
 src/backend/commands/explain.c                | 25 +++++
 src/backend/executor/nodeIndexonlyscan.c      | 97 ++++++++++++++++++-
 src/backend/executor/nodeIndexscan.c          | 56 ++++++++++-
 src/backend/nodes/copyfuncs.c                 |  2 +
 src/backend/nodes/outfuncs.c                  |  2 +
 src/backend/nodes/readfuncs.c                 |  2 +
 src/backend/optimizer/path/costsize.c         |  1 +
 src/backend/optimizer/path/indxpath.c         | 96 ++++++++++++++++++
 src/backend/optimizer/plan/createplan.c       | 20 ++--
 src/backend/optimizer/util/pathnode.c         | 64 ++++++++++++
 src/backend/optimizer/util/plancat.c          |  1 +
 src/backend/utils/misc/guc.c                  |  9 ++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/amapi.h                    |  8 ++
 src/include/access/genam.h                    |  2 +
 src/include/access/sdir.h                     |  7 ++
 src/include/nodes/execnodes.h                 |  6 ++
 src/include/nodes/pathnodes.h                 |  4 +
 src/include/nodes/plannodes.h                 |  4 +
 src/include/optimizer/cost.h                  |  1 +
 src/include/optimizer/pathnode.h              |  4 +
 src/test/regress/expected/sysviews.out        |  3 +-
 30 files changed, 429 insertions(+), 11 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index d3bf8665df..4eec21b99e 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -134,6 +134,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = blcostestimate;
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index af87f172a7..8c552c22bd 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -150,6 +150,7 @@ typedef struct IndexAmRoutine
     amendscan_function amendscan;
     ammarkpos_function ammarkpos;       /* can be NULL */
     amrestrpos_function amrestrpos;     /* can be NULL */
+    amskip_function amskip;             /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 7db3ae5ee0..e3e09efa31 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -115,6 +115,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = brincostestimate;
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index a400f1fedb..7e83a8ee58 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -66,6 +66,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = gincostestimate;
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 79fe6eb8d6..6fe9f8f5a9 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -87,6 +87,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = gistcostestimate;
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 3ec6d528e7..85d44ee58d 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -84,6 +84,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = hashcostestimate;
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 6b9750c244..c7b70997ff 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_skip		- advance past duplicate key values in a scan
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -736,6 +737,23 @@ index_can_return(Relation indexRelation, int attno)
 	return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
 }
 
+/* ----------------
+ *		index_skip
+ *
+ *		Skip past all tuples where the first 'prefix' columns have the
+ *		same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction,
+		   ScanDirection indexdir, bool scanstart, int prefix)
+{
+	SCAN_CHECKS;
+
+	return scan->indexRelation->rd_indam->amskip(scan, direction,
+												 indexdir, scanstart, prefix);
+}
+
 /* ----------------
  *		index_getprocid
  *
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 0efe05e552..0491ea172a 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -69,6 +69,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = spgcostestimate;
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a283e4d45c..3c6258761a 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -148,6 +148,7 @@ static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
 static void ExplainIndentText(ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 
 
@@ -1096,6 +1097,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
 }
 
+/*
+ * ExplainIndexSkipScanKeys -
+ *	  Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es)
+{
+	if (skipPrefixSize > 0)
+	{
+		if (es->format != EXPLAIN_FORMAT_TEXT)
+			ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+	}
+}
+
 /*
  * ExplainNode -
  *	  Appends a description of a plan tree to es->str
@@ -1433,6 +1450,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexScan  *indexscan = (IndexScan *) plan;
 
+				ExplainIndexSkipScanKeys(indexscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexscan->indexid,
 										indexscan->indexorderdir,
 										es);
@@ -1443,6 +1462,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
 
+				ExplainIndexSkipScanKeys(indexonlyscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexonlyscan->indexid,
 										indexonlyscan->indexorderdir,
 										es);
@@ -1703,6 +1724,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_IndexScan:
+			if (((IndexScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyBool("Skip scan", true, es);
 			show_scan_qual(((IndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexScan *) plan)->indexqualorig)
@@ -1716,6 +1739,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 										   planstate, es);
 			break;
 		case T_IndexOnlyScan:
+			if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyBool("Skip scan", true, es);
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 5617ac29e7..c4e4b087a7 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -41,6 +41,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/itemptr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -62,9 +63,26 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	EState	   *estate;
 	ExprContext *econtext;
 	ScanDirection direction;
+	ScanDirection readDirection;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	ItemPointerData startTid;
+	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
+
+	/*
+	 * Tells if the current position was reached via skipping. In this case
+	 * there is no nead for the index_getnext_tid
+	 */
+	bool skipped = false;
+
+	/*
+	 * Index only scan must be aware that in case of skipping we can return to
+	 * the starting point due to visibility checks. In this situation we need
+	 * to jump further, and number of skipping attempts tell us how far do we
+	 * need to do so.
+	 */
+	int skipAttempts = 0;
 
 	/*
 	 * extract necessary information from index scan node
@@ -72,7 +90,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -114,16 +132,87 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
 	}
+	else
+	{
+		ItemPointerCopy(&scandesc->xs_heaptid, &startTid);
+	}
+
+	/*
+	 * Check if we need to skip to the next key prefix, because we've been
+	 * asked to implement DISTINCT.
+	 *
+	 * When fetching a cursor in the direction opposite to a general scan
+	 * direction, the result must be what normal fetching should have
+	 * returned, but in reversed order. In other words, return the last or
+	 * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+	 * Due to that we skip also when the first tuple wasn't emitted yet, but
+	 * the directions are opposite.
+	 */
+	if (node->ioss_SkipPrefixSize > 0 &&
+		(node->ioss_FirstTupleEmitted ||
+		 ScanDirectionsAreOpposite(direction, indexonlyscan->indexorderdir)))
+	{
+		if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir,
+						!node->ioss_FirstTupleEmitted, node->ioss_SkipPrefixSize))
+		{
+			/*
+			 * Reached end of index. At this point currPos is invalidated, and
+			 * we need to reset ioss_FirstTupleEmitted, since otherwise after
+			 * going backwards, reaching the end of index, and going forward
+			 * again we apply skip again. It would be incorrect and lead to an
+			 * extra skipped item.
+			 */
+			node->ioss_FirstTupleEmitted = false;
+			return ExecClearTuple(slot);
+		}
+		else
+		{
+			skipAttempts = 1;
+			skipped = true;
+			tid = &scandesc->xs_heaptid;
+		}
+	}
+
+	readDirection = skipped ? indexonlyscan->indexorderdir : direction;
 
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (skipped || (tid = index_getnext_tid(scandesc, readDirection)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * While doing index only skip scan with advancing and reading in
+		 * different directions we can return to the same position where we
+		 * started after visibility check. Recognize such situations and skip
+		 * more.
+		 */
+		if ((readDirection != direction) &&
+			ItemPointerIsValid(&startTid) && ItemPointerEquals(&startTid, tid))
+		{
+			int i;
+			skipAttempts += 1;
+
+			for (i = 0; i < skipAttempts; i++)
+			{
+				if (!index_skip(scandesc, direction,
+								indexonlyscan->indexorderdir,
+								!node->ioss_FirstTupleEmitted,
+								node->ioss_SkipPrefixSize))
+				{
+					node->ioss_FirstTupleEmitted = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			tid = &scandesc->xs_heaptid;
+		}
+
+		skipped = false;
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -250,6 +339,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 							  ItemPointerGetBlockNumber(tid),
 							  estate->es_snapshot);
 
+		node->ioss_FirstTupleEmitted = true;
+
 		return slot;
 	}
 
@@ -504,6 +595,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+	indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->ioss_FirstTupleEmitted = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index d0a96a38e0..449aaec3ac 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,6 +85,13 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
+
+	/*
+	 * tells if the current position was reached via skipping. In this case
+	 * there is no nead for the index_getnext_tid
+	 */
+	bool skipped = false;
 
 	/*
 	 * extract necessary information from index scan node
@@ -92,7 +99,7 @@ IndexNext(IndexScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -117,6 +124,12 @@ IndexNext(IndexScanState *node)
 
 		node->iss_ScanDesc = scandesc;
 
+		/* Index skip scan assumes xs_want_itup, so set it to true */
+		if (indexscan->indexskipprefixsize > 0)
+			node->iss_ScanDesc->xs_want_itup = true;
+		else
+			node->iss_ScanDesc->xs_want_itup = false;
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -127,12 +140,48 @@ IndexNext(IndexScanState *node)
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	}
 
+	/*
+	 * Check if we need to skip to the next key prefix, because we've been
+	 * asked to implement DISTINCT.
+	 *
+	 * When fetching a cursor in the direction opposite to a general scan
+	 * direction, the result must be what normal fetching should have
+	 * returned, but in reversed order. In other words, return the last or
+	 * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+	 * Due to that we skip also when the first tuple wasn't emitted yet, but
+	 * the directions are opposite.
+	 */
+	if (node->iss_SkipPrefixSize > 0 &&
+		(node->iss_FirstTupleEmitted ||
+		 ScanDirectionsAreOpposite(direction, indexscan->indexorderdir)))
+	{
+		if (!index_skip(scandesc, direction, indexscan->indexorderdir,
+					   !node->iss_FirstTupleEmitted, node->iss_SkipPrefixSize))
+		{
+			/*
+			 * Reached end of index. At this point currPos is invalidated, and
+			 * we need to reset iss_FirstTupleEmitted, since otherwise after
+			 * going backwards, reaching the end of index, and going forward
+			 * again we apply skip again. It would be incorrect and lead to an
+			 * extra skipped item.
+			 */
+			node->iss_FirstTupleEmitted = false;
+			return ExecClearTuple(slot);
+		}
+		else
+		{
+			skipped = true;
+			index_fetch_heap(scandesc, slot);
+		}
+	}
+
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (skipped || index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
+		skipped = false;
 
 		/*
 		 * If the index was lossy, we have to recheck the index quals using
@@ -149,6 +198,7 @@ IndexNext(IndexScanState *node)
 			}
 		}
 
+		node->iss_FirstTupleEmitted = true;
 		return slot;
 	}
 
@@ -910,6 +960,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+	indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->iss_FirstTupleEmitted = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index efe7c3ae10..236bb1435c 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -493,6 +493,7 @@ _copyIndexScan(const IndexScan *from)
 	COPY_NODE_FIELD(indexorderbyorig);
 	COPY_NODE_FIELD(indexorderbyops);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
@@ -518,6 +519,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
 	COPY_NODE_FIELD(indexorderby);
 	COPY_NODE_FIELD(indextlist);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c3a9632992..88254446ce 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -562,6 +562,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
 	WRITE_NODE_FIELD(indexorderbyorig);
 	WRITE_NODE_FIELD(indexorderbyops);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
@@ -576,6 +577,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
 	WRITE_NODE_FIELD(indexorderby);
 	WRITE_NODE_FIELD(indextlist);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 3a18571d0c..9766fc11ed 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1829,6 +1829,7 @@ _readIndexScan(void)
 	READ_NODE_FIELD(indexorderbyorig);
 	READ_NODE_FIELD(indexorderbyops);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
@@ -1848,6 +1849,7 @@ _readIndexOnlyScan(void)
 	READ_NODE_FIELD(indexorderby);
 	READ_NODE_FIELD(indextlist);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 945aa93374..fe6ef62e8f 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -125,6 +125,7 @@ int			max_parallel_workers_per_gather = 2;
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
+bool		enable_indexskipscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 2a50272da6..340bbfcbb3 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -790,6 +790,16 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexPath  *ipath = (IndexPath *) lfirst(lc);
 
+		/*
+		 * To prevent unique paths from index skip scans being potentially used
+		 * when not needed scan keep them in a separate pathlist.
+		*/
+		if (ipath->indexskipprefix != 0)
+		{
+			add_unique_path(rel, (Path *) ipath);
+			continue;
+		}
+
 		if (index->amhasgettuple)
 			add_path(rel, (Path *) ipath);
 
@@ -878,6 +888,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
+	bool		not_empty_qual = false;
+	bool		can_skip;
 	int			indexcol;
 
 	/*
@@ -1027,6 +1039,60 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	index_only_scan = (scantype != ST_BITMAPSCAN &&
 					   check_index_only(rel, index));
 
+	/* Check if an index skip scan is possible. */
+	can_skip = enable_indexskipscan & index->amcanskip;
+
+	/*
+	 * Skip scan is not supported when there are qual conditions, which are not
+	 * covered by index. The reason for that is that those conditions are
+	 * evaluated later, already after skipping was applied.
+	 *
+	 * TODO: This implementation is too restrictive, and doesn't allow e.g.
+	 * index expressions. For that we need to examine index_clauses too.
+	 */
+	if (can_skip && root->parse->jointree != NULL)
+	{
+		ListCell *lc;
+
+		foreach(lc, (List *)root->parse->jointree->quals)
+		{
+			Node *expr, *qual = (Node *) lfirst(lc);
+			Var *var;
+			bool found = false;
+
+			if (!is_opclause(qual))
+			{
+				not_empty_qual = true;
+				break;
+			}
+
+			expr = get_leftop(qual);
+
+			if (!IsA(expr, Var))
+			{
+				not_empty_qual = true;
+				break;
+			}
+
+			var = (Var *) expr;
+
+			for (int i = 0; i < index->ncolumns; i++)
+			{
+				if (index->indexkeys[i] == var->varattno)
+				{
+					found = true;
+					break;
+				}
+			}
+
+			if (!found)
+			{
+				not_empty_qual = true;
+				break;
+			}
+		}
+	}
+
 	/*
 	 * 4. Generate an indexscan path if there are relevant restriction clauses
 	 * in the current clauses, OR the index ordering is potentially useful for
@@ -1050,6 +1116,21 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  false);
 		result = lappend(result, ipath);
 
+		/* Consider index skip scan as well */
+		if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
+		{
+			ListCell   *lc;
+
+			foreach(lc, root->query_uniquekeys)
+			{
+				UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+				result = lappend(result,
+								 create_skipscan_unique_path(root, index,
+															 (Path *) ipath,
+															 ukey->exprs));
+			}
+		}
+
 		/*
 		 * If appropriate, consider parallel index scan.  We don't allow
 		 * parallel index scan for bitmap index scans.
@@ -1105,6 +1186,21 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  false);
 			result = lappend(result, ipath);
 
+			/* Consider index skip scan as well */
+			if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
+			{
+				ListCell   *lc;
+
+				foreach(lc, root->query_uniquekeys)
+				{
+					UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+					result = lappend(result,
+									 create_skipscan_unique_path(root, index,
+																 (Path *) ipath,
+																 ukey->exprs));
+				}
+			}
+
 			/* If appropriate, consider parallel index scan */
 			if (index->amcanparallel &&
 				rel->consider_parallel && outer_relids == NULL &&
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index eb9543f6ad..5b591036fb 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -177,12 +177,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 								 Oid indexid, List *indexqual, List *indexqualorig,
 								 List *indexorderby, List *indexorderbyorig,
 								 List *indexorderbyops,
-								 ScanDirection indexscandir);
+								 ScanDirection indexscandir,
+								 int skipprefix);
 static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 Index scanrelid, Oid indexid,
 										 List *indexqual, List *indexorderby,
 										 List *indextlist,
-										 ScanDirection indexscandir);
+										 ScanDirection indexscandir,
+										 int skipprefix);
 static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
 											  List *indexqual,
 											  List *indexqualorig);
@@ -3005,7 +3007,8 @@ create_indexscan_plan(PlannerInfo *root,
 												fixed_indexquals,
 												fixed_indexorderbys,
 												best_path->indexinfo->indextlist,
-												best_path->indexscandir);
+												best_path->indexscandir,
+												best_path->indexskipprefix);
 	else
 		scan_plan = (Scan *) make_indexscan(tlist,
 											qpqual,
@@ -3016,7 +3019,8 @@ create_indexscan_plan(PlannerInfo *root,
 											fixed_indexorderbys,
 											indexorderbys,
 											indexorderbyops,
-											best_path->indexscandir);
+											best_path->indexscandir,
+											best_path->indexskipprefix);
 
 	copy_generic_path_info(&scan_plan->plan, &best_path->path);
 
@@ -5283,7 +5287,8 @@ make_indexscan(List *qptlist,
 			   List *indexorderby,
 			   List *indexorderbyorig,
 			   List *indexorderbyops,
-			   ScanDirection indexscandir)
+			   ScanDirection indexscandir,
+			   int skipPrefixSize)
 {
 	IndexScan  *node = makeNode(IndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5300,6 +5305,7 @@ make_indexscan(List *qptlist,
 	node->indexorderbyorig = indexorderbyorig;
 	node->indexorderbyops = indexorderbyops;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
@@ -5312,7 +5318,8 @@ make_indexonlyscan(List *qptlist,
 				   List *indexqual,
 				   List *indexorderby,
 				   List *indextlist,
-				   ScanDirection indexscandir)
+				   ScanDirection indexscandir,
+				   int skipPrefixSize)
 {
 	IndexOnlyScan *node = makeNode(IndexOnlyScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5327,6 +5334,7 @@ make_indexonlyscan(List *qptlist,
 	node->indexorderby = indexorderby;
 	node->indextlist = indextlist;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 60656bdca5..cac2ca0b0f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2971,6 +2971,70 @@ create_upper_unique_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_skipscan_unique_path
+ *	  Creates a pathnode the same as an existing IndexPath except based on
+ *	  skipping duplicate values.  This may or may not be cheaper than using
+ *	  create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root, IndexOptInfo *index,
+							Path *basepath, List *unique_exprs)
+{
+	IndexPath 	*pathnode = makeNode(IndexPath);
+	int 		numDistinctRows;
+	int 		distinctPrefixKeys;
+	ListCell 	*lc;
+	List 	   	*exprs = NIL;
+
+
+	distinctPrefixKeys = list_length(unique_exprs);
+
+	Assert(IsA(basepath, IndexPath));
+
+	/* We don't want to modify basepath, so make a copy. */
+	memcpy(pathnode, basepath, sizeof(IndexPath));
+
+	/*
+	 * Normally we can think about distinctPrefixKeys as just
+	 * a number of distinct keys. But if lets say we have a
+	 * distinct key a, and the index contains b, a in exactly
+	 * this order. In such situation we need to use position
+	 * of a in the index as distinctPrefixKeys, otherwise skip
+	 * will happen only by the first column.
+	 */
+	foreach(lc, unique_exprs)
+	{
+		Expr *unique_expr = (Expr *) lfirst(lc);
+		Var *var = (Var *) unique_expr;
+
+		exprs = lappend(exprs, unique_expr);
+
+		for (int i = 0; i < index->ncolumns; i++)
+		{
+			if (index->indexkeys[i] == var->varattno)
+			{
+				distinctPrefixKeys = Max(i + 1, distinctPrefixKeys);
+				break;
+			}
+		}
+	}
+
+	Assert(distinctPrefixKeys > 0);
+	pathnode->indexskipprefix = distinctPrefixKeys;
+
+	numDistinctRows = estimate_num_groups(root, exprs,
+										  pathnode->path.rows,
+										  NULL);
+
+	pathnode->path.total_cost = pathnode->path.startup_cost * numDistinctRows;
+	pathnode->path.rows = numDistinctRows;
+
+	return pathnode;
+}
+
 /*
  * create_agg_path
  *	  Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 0b2f9d398a..6da71e5828 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -272,6 +272,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanskip = (amroutine->amskip != NULL);
 			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 031ca0327f..56b2cecdfd 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -952,6 +952,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-skip-scan plans."),
+			NULL
+		},
+		&enable_indexskipscan,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e430e33c7b..95def9aa34 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -355,6 +355,7 @@
 #enable_hashjoin = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexskipscan = on
 #enable_material = on
 #enable_mergejoin = on
 #enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 4325faa460..85cca4d520 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -130,6 +130,13 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+								 ScanDirection dir,
+								 ScanDirection indexdir,
+								 bool start,
+								 int prefix);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -231,6 +238,7 @@ typedef struct IndexAmRoutine
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
 	amrestrpos_function amrestrpos; /* can be NULL */
+	amskip_function amskip;				/* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 931257bd81..95ce4a5d80 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -180,6 +180,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
 												   IndexBulkDeleteResult *stats);
 extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction,
+					   ScanDirection indexdir, bool start, int prefix);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/sdir.h b/src/include/access/sdir.h
index 23feb90986..094a127464 100644
--- a/src/include/access/sdir.h
+++ b/src/include/access/sdir.h
@@ -55,4 +55,11 @@ typedef enum ScanDirection
 #define ScanDirectionIsForward(direction) \
 	((bool) ((direction) == ForwardScanDirection))
 
+/*
+ * ScanDirectionsAreOpposite
+ *		True iff scan directions are backward/forward or forward/backward.
+ */
+#define ScanDirectionsAreOpposite(dirA, dirB) \
+	((bool) (dirA != NoMovementScanDirection && dirA == -dirB))
+
 #endif							/* SDIR_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0187989fd1..cf1e729bec 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1429,6 +1429,8 @@ typedef struct IndexScanState
 	ExprContext *iss_RuntimeContext;
 	Relation	iss_RelationDesc;
 	struct IndexScanDescData *iss_ScanDesc;
+	int         iss_SkipPrefixSize;
+	bool		iss_FirstTupleEmitted;
 
 	/* These are needed for re-checking ORDER BY expr ordering */
 	pairingheap *iss_ReorderQueue;
@@ -1458,6 +1460,8 @@ typedef struct IndexScanState
  *		TableSlot		   slot for holding tuples fetched from the table
  *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
+ *		SkipPrefixSize	   number of keys for skip-based DISTINCT
+ *		FirstTupleEmitted  has the first tuple been emitted
  * ----------------
  */
 typedef struct IndexOnlyScanState
@@ -1476,6 +1480,8 @@ typedef struct IndexOnlyScanState
 	struct IndexScanDescData *ioss_ScanDesc;
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
+	int         ioss_SkipPrefixSize;
+	bool		ioss_FirstTupleEmitted;
 	Size		ioss_PscanLen;
 } IndexOnlyScanState;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index a5c406bd4e..7287594582 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1229,6 +1229,9 @@ typedef struct Path
  * we need not recompute them when considering using the same index in a
  * bitmap index/heap scan (see BitmapHeapPath).  The costs of the IndexPath
  * itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
  *----------
  */
 typedef struct IndexPath
@@ -1241,6 +1244,7 @@ typedef struct IndexPath
 	ScanDirection indexscandir;
 	Cost		indextotalcost;
 	Selectivity indexselectivity;
+	int			indexskipprefix;
 } IndexPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 83e01074ed..efafdf9361 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -409,6 +409,8 @@ typedef struct IndexScan
 	List	   *indexorderbyorig;	/* the same in original form */
 	List	   *indexorderbyops;	/* OIDs of sort ops for ORDER BY exprs */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for distinct
+										 * scans */
 } IndexScan;
 
 /* ----------------
@@ -436,6 +438,8 @@ typedef struct IndexOnlyScan
 	List	   *indexorderby;	/* list of index ORDER BY exprs */
 	List	   *indextlist;		/* TargetEntry list describing index's cols */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for distinct
+										 * scans */
 } IndexOnlyScan;
 
 /* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 613db8eab6..c0f176eaaa 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
 extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 6796ad8cb7..705deaf7bd 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -207,6 +207,10 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
 												 Path *subpath,
 												 int numCols,
 												 double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+											  IndexOptInfo *index,
+											  Path *subpath,
+											  List *unique_exprs);
 extern AggPath *create_agg_path(PlannerInfo *root,
 								RelOptInfo *rel,
 								Path *subpath,
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 06c4c3e476..4da9594925 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -79,6 +79,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexskipscan           | on
  enable_material                | on
  enable_mergejoin               | on
  enable_nestloop                | on
@@ -90,7 +91,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(18 rows)
+(20 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
-- 
2.25.0

0005-Btree-implementation-of-skipping.patchapplication/octet-stream; name=0005-Btree-implementation-of-skipping.patchDownload

From 6456f0ff5e1a80a745bfb1e79ec90721ff9495d0 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:45:08 +0200
Subject: [PATCH 5/7] Btree implementation of skipping

Implementation of Index Skip Scan for Btree. To make it suitable for
both situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 src/backend/access/nbtree/nbtree.c            |  13 +
 src/backend/access/nbtree/nbtsearch.c         | 471 ++++++++++++-
 src/include/access/nbtree.h                   |   7 +
 src/test/regress/expected/join.out            |   3 +
 src/test/regress/expected/select_distinct.out | 621 ++++++++++++++++++
 src/test/regress/sql/join.sql                 |   5 +
 src/test/regress/sql/select_distinct.sql      | 254 +++++++
 7 files changed, 1373 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e947addef6..f5ca715827 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -136,6 +136,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
+	amroutine->amskip = btskip;
 	amroutine->amcostestimate = btcostestimate;
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
@@ -385,6 +386,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	 */
 	so->currTuples = so->markTuples = NULL;
 
+	so->skipScanKey = NULL;
+
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
@@ -452,6 +455,16 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	_bt_preprocess_array_keys(scan);
 }
 
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction,
+	   ScanDirection indexdir, bool start, int prefix)
+{
+	return _bt_skip(scan, direction, indexdir, start, prefix);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index f228c87a2b..b0162a0628 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -45,7 +45,10 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+											Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+										Buffer buf, ScanDirection dir);
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -1483,6 +1486,417 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 	return true;
 }
 
+/*
+ *  _bt_skip() -- Skip items that have the same prefix as the most recently
+ * 				  fetched index tuple.
+ *
+ * 		The current position is set so that a subsequent call to _bt_next will
+ * 		fetch the first tuple that differs in the leading 'prefix' keys.
+ *
+ * 		There are four different kinds of skipping (depending on dir and
+ * 		indexdir, that are important to distinguish, especially in the presense
+ * 		of an index condition:
+ *
+ * 		* Advancing forward and reading forward
+ * 			simple scan
+ *
+ * 		* Advancing forward and reading backward
+ * 			scan inside a cursor fetching backward, when skipping is necessary
+ * 			right from the start
+ *
+ * 		* Advancing backward and reading forward
+ * 			scan with order by desc inside a cursor fetching forward, when
+ * 			skipping is necessary right from the start
+ *
+ * 		* Advancing backward and reading backward
+ * 			simple scan with order by desc
+ *
+ *      The current page is searched for the next unique value. If none is found
+ *      we will do a scan from the root in order to find the next page with
+ *      a unique value.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir,
+		 ScanDirection indexdir, bool scanstart, int prefix)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTStack stack;
+	Buffer buf;
+	OffsetNumber offnum;
+	BTScanPosItem *currItem;
+	Relation 	 indexRel = scan->indexRelation;
+
+	/* We want to return tuples, and we need a starting point */
+	Assert(scan->xs_want_itup);
+	Assert(scan->xs_itup);
+
+	if (so->numKilled > 0)
+		_bt_killitems(scan);
+
+	/* If skipScanKey is NULL then we initialize it with _bt_mkscankey */
+	if (so->skipScanKey == NULL)
+	{
+		so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+		so->skipScanKey->keysz = prefix;
+		so->skipScanKey->scantid = NULL;
+	}
+	so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+	_bt_update_skip_scankeys(scan, indexRel);
+
+	/* Check if the next unique key can be found within the current page.
+	 * Since we do not lock the current page between jumps, it's possible
+	 * that it was splitted since the last time we saw it. This is fine in
+	 * case of scanning forward, since page split to the right and we are
+	 * still on the left most page. In case of scanning backwards it's
+	 * possible to loose some pages and we need to remember the previous
+	 * page, and then follow the right link from the current page until we
+	 * find the original one.
+	 *
+	 * Since the whole idea of checking the current page is to protect
+	 * ourselves and make more performant statistic mismatch case when
+	 * there are too many distinct values for jumping, it's not clear if
+	 * the complexity of this solution in case of backward scan is
+	 * justified, so for now just avoid it.
+	 */
+	if (BufferIsValid(so->currPos.buf) && ScanDirectionIsForward(dir))
+	{
+		LockBuffer(so->currPos.buf, BT_READ);
+
+		if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+		{
+			bool keyFound = false;
+
+			offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+			/* Lock the page for SERIALIZABLE transactions */
+			PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+							  scan->xs_snapshot);
+
+			/* We know in which direction to look */
+			_bt_initialize_more_data(so, dir);
+
+			/* Now read the data */
+			keyFound = _bt_readpage(scan, dir, offnum);
+
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+
+			if (keyFound)
+			{
+				/* set IndexTuple */
+				currItem = &so->currPos.items[so->currPos.itemIndex];
+				scan->xs_heaptid = currItem->heapTid;
+				scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+				return true;
+			}
+		}
+		else
+		{
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		}
+	}
+
+	if (BufferIsValid(so->currPos.buf))
+	{
+		ReleaseBuffer(so->currPos.buf);
+		so->currPos.buf = InvalidBuffer;
+	}
+
+	/*
+	 * We haven't found scan key within the current page, so let's scan from
+	 * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+	 * number
+	 */
+	stack = _bt_search(scan->indexRelation, so->skipScanKey,
+					   &buf, BT_READ, scan->xs_snapshot);
+	_bt_freestack(stack);
+	so->currPos.buf = buf;
+	offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+	/* Lock the page for SERIALIZABLE transactions */
+	PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+					  scan->xs_snapshot);
+
+	/* We know in which direction to look */
+	_bt_initialize_more_data(so, dir);
+
+	/*
+	 * Simplest case is when both directions are forward, when we are already
+	 * at the next distinct key at the beginning of the series (so everything
+	 * else would be done in _bt_readpage)
+	 *
+	 * The case when both directions are backwards is also simple, but we need
+	 * to go one step back, since we need a last element from the previous
+	 * series.
+	 */
+	if (ScanDirectionIsBackward(dir) && ScanDirectionIsBackward(indexdir))
+		 offnum = OffsetNumberPrev(offnum);
+
+	/*
+	 * Andvance backward but read forward. At this moment we are at the next
+	 * distinct key at the beginning of the series. In case if scan just
+	 * started, we can read forward without doing anything else. Otherwise
+	 * find previous distinct key and the beginning of it's series and read
+	 * forward from there. To do so, go back one step, perform binary search
+	 * to find the first item in the series and let _bt_readpage do everything
+	 * else.
+	 */
+	else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir))
+	{
+		if (!scanstart)
+		{
+			/* Reading forward means we expect to see more data on the right */
+			so->currPos.moreRight = true;
+
+			offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+			/* One step back to find a previous value */
+			_bt_readpage(scan, dir, offnum);
+
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+			if (_bt_next(scan, dir))
+			{
+				LockBuffer(so->currPos.buf, BT_READ);
+				_bt_update_skip_scankeys(scan, indexRel);
+
+				/*
+				 * And now find the last item from the sequence for the
+				 * current, value with the intention do OffsetNumberNext. As a
+				 * result we end up on a first element from the sequence.
+				 */
+				if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+					offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+				else
+				{
+					if (BufferIsValid(so->currPos.buf))
+					{
+						/* Before leaving current page, deal with any killed items */
+						if (so->numKilled > 0)
+							_bt_killitems(scan);
+
+						LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+						ReleaseBuffer(so->currPos.buf);
+						so->currPos.buf = InvalidBuffer;
+					}
+
+					stack = _bt_search(scan->indexRelation, so->skipScanKey,
+									   &buf, BT_READ, scan->xs_snapshot);
+					_bt_freestack(stack);
+					so->currPos.buf = buf;
+					offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+				}
+			}
+			else
+			{
+				pfree(so->skipScanKey);
+				so->skipScanKey = NULL;
+				return false;
+			}
+		}
+	}
+
+	/*
+	 * Advance forward but read backward. At this moment we are at the next
+	 * distinct key at the beginning of the series. In case if scan just
+	 * started, we can go one step back and read forward without doing
+	 * anything else. Otherwise find the next distinct key and the beginning
+	 * of it's series, go one step back and read backward from there.
+	 *
+	 * An interesting situation can happen if one of distinct keys do not pass
+	 * a corresponding index condition at all. In this case reading backward
+	 * can lead to a previous distinct key being found, creating a loop. To
+	 * avoid that check the value to be returned, and jump one more time if
+	 * it's the same as at the beginning. Note that we do not check visibility
+	 * here, and dead tuples could also lead to the same situation. This has to
+	 * be checked on the caller side.
+	 */
+	else if (ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir))
+	{
+		if (scanstart)
+			offnum = OffsetNumberPrev(offnum);
+		else
+		{
+			OffsetNumber nextOffset,
+						startOffset,
+						jumpOffset;
+
+			IndexTuple startItup = CopyIndexTuple(scan->xs_itup);
+			Page page = BufferGetPage(so->currPos.buf);
+
+			/* We are at the end and need to return */
+			if ((offnum > PageGetMaxOffsetNumber(page)) &
+				(so->currPos.nextPage == P_NONE))
+			{
+				LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+
+				BTScanPosUnpinIfPinned(so->currPos);
+				BTScanPosInvalidate(so->currPos);
+
+				pfree(so->skipScanKey);
+				so->skipScanKey = NULL;
+				return false;
+			}
+
+			nextOffset = startOffset = ItemPointerGetOffsetNumber(&scan->xs_itup->t_tid);
+
+			/* Reading backwards means we expect to see more data on the left */
+			so->currPos.moreLeft = true;
+
+			while (nextOffset == startOffset)
+			{
+				IndexTuple itup;
+				CHECK_FOR_INTERRUPTS();
+
+				/*
+				 * Find a next index tuple to update scan key. It could be at
+				 * the end, so check for max offset
+				 */
+				if (!_bt_readpage(scan, ForwardScanDirection, offnum))
+				{
+					/*
+					 * There's no actually-matching data on this page.  Try to
+					 * advance to the next page. Return false if there's no
+					 * matching data at all.
+					 */
+					LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+					if (!_bt_steppage(scan, dir))
+					{
+						pfree(so->skipScanKey);
+						so->skipScanKey = NULL;
+						return false;
+					}
+					LockBuffer(so->currPos.buf, BT_READ);
+				}
+
+				currItem = &so->currPos.items[so->currPos.firstItem];
+				itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+				scan->xs_itup = itup;
+
+				_bt_update_skip_scankeys(scan, indexRel);
+				if (BufferIsValid(so->currPos.buf))
+				{
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_bt_killitems(scan);
+
+					LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+					ReleaseBuffer(so->currPos.buf);
+					so->currPos.buf = InvalidBuffer;
+				}
+
+				stack = _bt_search(scan->indexRelation, so->skipScanKey,
+								   &buf, BT_READ, scan->xs_snapshot);
+				_bt_freestack(stack);
+				so->currPos.buf = buf;
+				jumpOffset = offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+				offnum = OffsetNumberPrev(offnum);
+
+				if (!_bt_readpage(scan, indexdir, offnum))
+				{
+					/*
+					 * There's no actually-matching data on this page.  Try to
+					 * advance to the next page. Return false if there's no
+					 * matching data at all.
+					 */
+					LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+					if (!_bt_steppage(scan, indexdir))
+					{
+						pfree(so->skipScanKey);
+						so->skipScanKey = NULL;
+						return false;
+					}
+					LockBuffer(so->currPos.buf, BT_READ);
+				}
+
+				currItem = &so->currPos.items[so->currPos.lastItem];
+				itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+				nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);
+
+				/*
+				 * To check if we returned the same tuple, try to find a
+				 * startItup on the current page. For that we need to update
+				 * scankey to match the whole tuple and set nextkey to return
+				 * an exact tuple, not the next one. If the nextOffset is the
+				 * same as before, it means we are in the loop, return offnum
+				 * to the original position and jump further
+				 */
+				scan->xs_itup = startItup;
+				_bt_update_skip_scankeys(scan, indexRel);
+
+				so->skipScanKey->keysz = IndexRelationGetNumberOfKeyAttributes(indexRel);
+				so->skipScanKey->nextkey = false;
+
+				if (_bt_scankey_within_page(scan, so->skipScanKey,
+											so->currPos.buf, dir))
+				{
+					OffsetNumber maxoff;
+					startOffset = _bt_binsrch(scan->indexRelation,
+											  so->skipScanKey,
+											  so->currPos.buf);
+
+					page = BufferGetPage(so->currPos.buf);
+					maxoff = PageGetMaxOffsetNumber(page);
+
+					if (nextOffset <= startOffset)
+					{
+						offnum = jumpOffset;
+						nextOffset = startOffset;
+					}
+
+					if ((offnum > maxoff) & (so->currPos.nextPage == P_NONE))
+					{
+						LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+
+						BTScanPosUnpinIfPinned(so->currPos);
+						BTScanPosInvalidate(so->currPos);
+
+						pfree(so->skipScanKey);
+						so->skipScanKey = NULL;
+						return false;
+					}
+				}
+
+				/* Return original scankey options */
+				so->skipScanKey->keysz = prefix;
+				so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+			}
+		}
+	}
+
+	/* Now read the data */
+	if (!_bt_readpage(scan, indexdir, offnum))
+	{
+		/*
+		 * There's no actually-matching data on this page.  Try to advance to
+		 * the next page.  Return false if there's no matching data at all.
+		 */
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		if (!_bt_steppage(scan, dir))
+		{
+			pfree(so->skipScanKey);
+			so->skipScanKey = NULL;
+			return false;
+		}
+	}
+	else
+	{
+		/* Drop the lock, and maybe the pin, on the current page */
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	}
+
+	/* And set IndexTuple */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_heaptid = currItem->heapTid;
+	scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+	so->currPos.moreLeft = true;
+	so->currPos.moreRight = true;
+
+	return true;
+}
+
 /*
  *	_bt_readpage() -- Load data from current index page into so->currPos
  *
@@ -2479,3 +2893,58 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 	so->numKilled = 0;			/* just paranoia */
 	so->markItemIndex = -1;		/* ditto */
 }
+
+/*
+ * _bt_update_skip_scankeys() -- set up a new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+	TupleDesc		itupdesc;
+	int			indnkeyatts,
+				i;
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	ScanKey			scankeys = so->skipScanKey->scankeys;
+
+	itupdesc = RelationGetDescr(indexRel);
+	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+	for (i = 0; i < indnkeyatts; i++)
+	{
+		Datum datum;
+		bool null;
+		int flags;
+
+		datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+		flags = (null ? SK_ISNULL : 0) |
+				(indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+		scankeys[i].sk_flags = flags;
+		scankeys[i].sk_argument = datum;
+	}
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ *
+ * Scankey nextkey will tell us if we need to find a current key or the next
+ * one, which affects whether or not it's ok to be equal to the page highkey.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+						Buffer buf, ScanDirection dir)
+{
+	OffsetNumber low, high;
+	Page page = BufferGetPage(buf);
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	int high_compare = key->nextkey ? 0 : 1;
+
+	low = P_FIRSTDATAKEY(opaque);
+	high = PageGetMaxOffsetNumber(page);
+
+	if (unlikely(high < low))
+		return false;
+
+	return (_bt_compare(scan->indexRelation, key, page, low) > 0 &&
+			_bt_compare(scan->indexRelation, key, page, high) < high_compare);
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 79506c748b..efd27c6261 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -938,6 +938,9 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* Work space for _bt_skip */
+	BTScanInsert	skipScanKey;	/* used to control skipping */
+
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -1095,6 +1098,8 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir,
+					 ScanDirection indexdir, bool start, int prefix);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 							   Snapshot snapshot);
 
@@ -1119,6 +1124,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
 extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir,
+				   ScanDirection indexdir, bool start, int prefix);
 extern bool btproperty(Oid index_oid, int attno,
 					   IndexAMProperty prop, const char *propname,
 					   bool *res, bool *isnull);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index a46b1573bd..a47e9faed2 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4429,6 +4429,8 @@ select d.* from d left join (select * from b group by b.id, b.c_id) s
          ->  Seq Scan on d
 (8 rows)
 
+-- disable index skip scan to prevent it interfering with the plan
+set enable_indexskipscan to off;
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
@@ -4446,6 +4448,7 @@ select d.* from d left join (select distinct * from b) s
          ->  Seq Scan on d
 (9 rows)
 
+set enable_indexskipscan to on;
 -- check join removal works when uniqueness of the join condition is enforced
 -- by a UNION
 explain (costs off)
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index 11c6f50fbf..b811f7d7a1 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -306,3 +306,624 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
  t
 (1 row)
 
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+CREATE INDEX ON distinct_a ((a + 1));
+ANALYZE distinct_a;
+SELECT DISTINCT a FROM distinct_a;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+ a 
+---
+ 1
+(1 row)
+
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Index Only Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+(2 rows)
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- test index skip scan for expressions
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT (a + 1) FROM distinct_a;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Scan using distinct_a_expr_idx on distinct_a
+   Skip scan: true
+(2 rows)
+
+SELECT DISTINCT (a + 1) FROM distinct_a;
+ ?column? 
+----------
+        2
+        3
+        4
+        5
+        6
+(5 rows)
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: true
+   Index Cond: (b = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: true
+   Index Cond: (b = 2)
+(3 rows)
+
+DROP INDEX distinct_a_b_a;
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b 
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+                          QUERY PLAN                          
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: true
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+                              QUERY PLAN                               
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: true
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+ 2 | 1 | 10
+ 3 | 1 | 10
+ 4 | 1 | 10
+ 5 | 1 | 10
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Unique
+   ->  Bitmap Heap Scan on distinct_a
+         Recheck Cond: (a = 1)
+         ->  Bitmap Index Scan on distinct_a_a_b_idx
+               Index Cond: (a = 1)
+(5 rows)
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Unique
+   ->  Index Scan using distinct_a_a_b_idx on distinct_a
+         Index Cond: (b = 2)
+         Filter: (c = 10)
+(4 rows)
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+ a | a 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 3
+ 4 | 4
+ 5 | 5
+(5 rows)
+
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+ a | ?column? 
+---+----------
+ 1 |        1
+ 2 |        1
+ 3 |        1
+ 4 |        1
+ 5 |        1
+(5 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+FETCH FROM c;
+ a 
+---
+ 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a 
+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 |  9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+                                 QUERY PLAN                                 
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+   Skip scan: true
+   Index Cond: ((b >= 1) AND (c = 0))
+(3 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c 
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
+-- test tuple killing
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 1 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 5 | 1 | 1 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 5 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 1 | 1 | 1 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(5 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(5 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 1403e0ffe7..13e35fda88 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1519,11 +1519,16 @@ explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
   on d.a = s.id;
 
+-- disable index skip scan to prevent it interfering with the plan
+set enable_indexskipscan to off;
+
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
   on d.a = s.id;
 
+set enable_indexskipscan to on;
+
 -- check join removal works when uniqueness of the join condition is enforced
 -- by a UNION
 explain (costs off)
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index 33102744eb..da07988d35 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -135,3 +135,257 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
 SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
 SELECT 2 IS NOT DISTINCT FROM null as "no";
 SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+CREATE INDEX ON distinct_a ((a + 1));
+ANALYZE distinct_a;
+
+SELECT DISTINCT a FROM distinct_a;
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- test index skip scan for expressions
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT (a + 1) FROM distinct_a;
+SELECT DISTINCT (a + 1) FROM distinct_a;
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+DROP INDEX distinct_a_b_a;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
+
+-- test tuple killing
+
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
-- 
2.25.0

0006-Index-skip-scan-documentation.patchapplication/octet-stream; name=0006-Index-skip-scan-documentation.patchDownload

From 0bea40cc7ab57fe5568e891a9e5e3d2d48054069 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:45:21 +0200
Subject: [PATCH 6/7] Index skip scan documentation

Covers added GUC and the extenion of index am interface.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 doc/src/sgml/config.sgml  | 15 ++++++++++
 doc/src/sgml/indexam.sgml | 62 +++++++++++++++++++++++++++++++++++++++
 doc/src/sgml/indices.sgml | 23 +++++++++++++++
 3 files changed, 100 insertions(+)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b353c61683..aa89e96fc0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4620,6 +4620,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+      <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of index-skip-scan plan
+        types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+        <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-material" xreflabel="enable_material">
       <term><varname>enable_material</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 8c552c22bd..b3ca91211f 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -694,6 +694,68 @@ amrestrpos (IndexScanDesc scan);
 
   <para>
 <programlisting>
+bool
+amskip (IndexScanDesc scan,
+        ScanDirection direction,
+        ScanDirection indexdir,
+        bool scanstart,
+        int prefix);
+</programlisting>
+  Skip past all tuples where the first 'prefix' columns have the same value as
+  the last tuple returned in the current scan. The arguments are:
+
+   <variablelist>
+    <varlistentry>
+     <term><parameter>scan</parameter></term>
+     <listitem>
+      <para>
+       Index scan information
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>direction</parameter></term>
+     <listitem>
+      <para>
+       The direction in which data is advancing.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>indexdir</parameter></term>
+     <listitem>
+      <para>
+        The index direction, in which data must be read.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>scanstart</parameter></term>
+     <listitem>
+      <para>
+        Whether or not it is a start of the scan.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>prefix</parameter></term>
+     <listitem>
+      <para>
+        Distinct prefix size.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+
+  </para>
+
+  <para>
+<programlisting>
 Size
 amestimateparallelscan (void);
 </programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 28adaba72d..1cf8550e0c 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1281,6 +1281,29 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
    and later will recognize such cases and allow index-only scans to be
    generated, but older versions will not.
   </para>
+
+  <sect2 id="indexes-index-skip-scans">
+    <title>Index Skip Scans</title>
+
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index</primary>
+      <secondary>index-skip scans</secondary>
+    </indexterm>
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index-skip scan</primary>
+    </indexterm>
+
+    <para>
+     When the rows retrieved from an index scan are then deduplicated by
+     eliminating rows matching on a prefix of index keys (e.g. when using
+     <literal>SELECT DISTINCT</literal>), the planner will consider
+     skipping groups of rows with a matching key prefix. When a row with
+     a particular prefix is found, remaining rows with the same key prefix
+     are skipped.  The larger the number of rows with the same key prefix
+     rows (i.e. the lower the number of distinct key prefixes in the index),
+     the more efficient this is.
+    </para>
+  </sect2>
  </sect1>
 
 
-- 
2.25.0

0007-planner-fixes.patchapplication/octet-stream; name=0007-planner-fixes.patchDownload

From 0bd1749cffc389fb38111903f38f18e84218e5d1 Mon Sep 17 00:00:00 2001
From: Floris van Nee <florisvannee@optiver.com>
Date: Sun, 12 Jul 2020 13:36:25 +0200
Subject: [PATCH 7/7] planner fixes

---
 src/backend/optimizer/path/indxpath.c   | 44 +++++++++++----
 src/backend/optimizer/path/pathkeys.c   | 72 +++++++++++++++++++++++++
 src/backend/optimizer/path/uniquekeys.c | 17 +++---
 src/backend/optimizer/plan/planner.c    | 14 +++--
 src/backend/optimizer/util/pathnode.c   | 36 ++-----------
 src/include/optimizer/pathnode.h        |  2 +-
 src/include/optimizer/paths.h           |  4 ++
 src/test/regress/expected/sysviews.out  |  2 +-
 8 files changed, 129 insertions(+), 62 deletions(-)

diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 340bbfcbb3..9eff71e5d5 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -1119,15 +1119,27 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 		/* Consider index skip scan as well */
 		if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
 		{
-			ListCell   *lc;
+			int numusefulkeys = list_length(useful_pathkeys);
+			int numsortkeys = list_length(root->query_pathkeys);
 
-			foreach(lc, root->query_uniquekeys)
+			if (numusefulkeys == numsortkeys)
 			{
-				UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+				int prefix;
+				if (list_length(root->distinct_pathkeys) > 0)
+					prefix = find_index_prefix_for_pathkey(root,
+														   index,
+														   ForwardScanDirection,
+														   llast_node(PathKey,
+														   root->distinct_pathkeys));
+				else
+					/* all are distinct keys are constant and optimized away.
+					 * skipping with 1 is sufficient as all are constant anyway
+					 */
+					prefix = 1;
+
 				result = lappend(result,
 								 create_skipscan_unique_path(root, index,
-															 (Path *) ipath,
-															 ukey->exprs));
+															 (Path *) ipath, prefix));
 			}
 		}
 
@@ -1189,15 +1201,27 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			/* Consider index skip scan as well */
 			if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
 			{
-				ListCell   *lc;
+				int numusefulkeys = list_length(useful_pathkeys);
+				int numsortkeys = list_length(root->query_pathkeys);
 
-				foreach(lc, root->query_uniquekeys)
+				if (numusefulkeys == numsortkeys)
 				{
-					UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+					int prefix;
+					if (list_length(root->distinct_pathkeys) > 0)
+						prefix = find_index_prefix_for_pathkey(root,
+															   index,
+															   BackwardScanDirection,
+															   llast_node(PathKey,
+															   root->distinct_pathkeys));
+					else
+						/* all are distinct keys are constant and optimized away.
+						 * skipping with 1 is sufficient as all are constant anyway
+						 */
+						prefix = 1;
+
 					result = lappend(result,
 									 create_skipscan_unique_path(root, index,
-																 (Path *) ipath,
-																 ukey->exprs));
+																 (Path *) ipath, prefix));
 				}
 			}
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index a4fc4f252d..3fa533be95 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -522,6 +522,78 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  *		NEW PATHKEY FORMATION
  ****************************************************************************/
 
+/*
+ * Find the prefix size for a specific path key in an index.
+ * For example, an index with (a,b,c) finding path key b will
+ * return prefix 2.
+ * Returns 0 when not found.
+ */
+int
+find_index_prefix_for_pathkey(PlannerInfo *root,
+					 IndexOptInfo *index,
+					 ScanDirection scandir,
+					 PathKey *pathkey)
+{
+	ListCell   *lc;
+	int			i;
+
+	i = 0;
+	foreach(lc, index->indextlist)
+	{
+		TargetEntry *indextle = (TargetEntry *) lfirst(lc);
+		Expr	   *indexkey;
+		bool		reverse_sort;
+		bool		nulls_first;
+		PathKey    *cpathkey;
+
+		/*
+		 * INCLUDE columns are stored in index unordered, so they don't
+		 * support ordered index scan.
+		 */
+		if (i >= index->nkeycolumns)
+			break;
+
+		/* We assume we don't need to make a copy of the tlist item */
+		indexkey = indextle->expr;
+
+		if (ScanDirectionIsBackward(scandir))
+		{
+			reverse_sort = !index->reverse_sort[i];
+			nulls_first = !index->nulls_first[i];
+		}
+		else
+		{
+			reverse_sort = index->reverse_sort[i];
+			nulls_first = index->nulls_first[i];
+		}
+
+		/*
+		 * OK, try to make a canonical pathkey for this sort key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  indexkey,
+											  NULL,
+											  index->sortopfamily[i],
+											  index->opcintype[i],
+											  index->indexcollations[i],
+											  reverse_sort,
+											  nulls_first,
+											  0,
+											  index->rel->relids,
+											  false);
+
+		if (cpathkey == pathkey)
+		{
+			return i + 1;
+		}
+
+		i++;
+	}
+
+	return 0;
+}
+
 /*
  * build_index_pathkeys
  *	  Build a pathkeys list that describes the ordering induced by an index
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
index 7d9c0aefbf..c02a3c0ed8 100644
--- a/src/backend/optimizer/path/uniquekeys.c
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -1136,6 +1136,7 @@ build_uniquekeys(PlannerInfo *root, List *sortclauses)
 	List *result = NIL;
 	List *sortkeys;
 	ListCell *l;
+	List *exprs = NIL;
 
 	sortkeys = make_pathkeys_for_uniquekeys(root,
 											sortclauses,
@@ -1146,18 +1147,14 @@ build_uniquekeys(PlannerInfo *root, List *sortclauses)
 	{
 		PathKey    *pathkey = (PathKey *) lfirst(l);
 		EquivalenceClass *ec = pathkey->pk_eclass;
-		ListCell   *k;
-		List *exprs = NIL;
-
-		foreach(k, ec->ec_members)
-		{
-			EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
-			exprs = lappend(exprs, mem->em_expr);
-		}
-
-		result = lappend(result, makeUniqueKey(exprs, false, false));
+		EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members));
+		if (EC_MUST_BE_REDUNDANT(ec))
+			continue;
+		exprs = lappend(exprs, mem->em_expr);
 	}
 
+		result = lappend(result, makeUniqueKey(exprs, false));
+
 	return result;
 }
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index a5f3c28fb5..80fbaadd21 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3627,12 +3627,18 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 
 	if (parse->distinctClause &&
 		grouping_is_sortable(parse->distinctClause))
+	{
 		root->distinct_pathkeys =
 			make_pathkeys_for_sortclauses(root,
 										  parse->distinctClause,
 										  tlist);
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
+	{
 		root->distinct_pathkeys = NIL;
+		root->query_uniquekeys = NIL;
+	}
 
 	root->sort_pathkeys =
 		make_pathkeys_for_sortclauses(root,
@@ -3663,17 +3669,9 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 		root->query_pathkeys = root->window_pathkeys;
 	else if (list_length(root->distinct_pathkeys) >
 			 list_length(root->sort_pathkeys))
-	{
 		root->query_pathkeys = root->distinct_pathkeys;
-		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
-	}
 	else if (root->sort_pathkeys)
-	{
 		root->query_pathkeys = root->sort_pathkeys;
-
-		if (root->distinct_pathkeys)
-			root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
-	}
 	else
 		root->query_pathkeys = NIL;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index cac2ca0b0f..340e8384e6 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2981,49 +2981,21 @@ create_upper_unique_path(PlannerInfo *root,
  */
 IndexPath *
 create_skipscan_unique_path(PlannerInfo *root, IndexOptInfo *index,
-							Path *basepath, List *unique_exprs)
+							Path *basepath, int prefix)
 {
 	IndexPath 	*pathnode = makeNode(IndexPath);
 	int 		numDistinctRows;
-	int 		distinctPrefixKeys;
 	ListCell 	*lc;
 	List 	   	*exprs = NIL;
 
-
-	distinctPrefixKeys = list_length(unique_exprs);
-
 	Assert(IsA(basepath, IndexPath));
 
 	/* We don't want to modify basepath, so make a copy. */
 	memcpy(pathnode, basepath, sizeof(IndexPath));
 
-	/*
-	 * Normally we can think about distinctPrefixKeys as just
-	 * a number of distinct keys. But if lets say we have a
-	 * distinct key a, and the index contains b, a in exactly
-	 * this order. In such situation we need to use position
-	 * of a in the index as distinctPrefixKeys, otherwise skip
-	 * will happen only by the first column.
-	 */
-	foreach(lc, unique_exprs)
-	{
-		Expr *unique_expr = (Expr *) lfirst(lc);
-		Var *var = (Var *) unique_expr;
-
-		exprs = lappend(exprs, unique_expr);
-
-		for (int i = 0; i < index->ncolumns; i++)
-		{
-			if (index->indexkeys[i] == var->varattno)
-			{
-				distinctPrefixKeys = Max(i + 1, distinctPrefixKeys);
-				break;
-			}
-		}
-	}
-
-	Assert(distinctPrefixKeys > 0);
-	pathnode->indexskipprefix = distinctPrefixKeys;
+	Assert(prefix > 0);
+	pathnode->indexskipprefix = prefix;
+	pathnode->path.uniquekeys = root->query_uniquekeys;
 
 	numDistinctRows = estimate_num_groups(root, exprs,
 										  pathnode->path.rows,
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 705deaf7bd..8ec1780a56 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -210,7 +210,7 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
 extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
 											  IndexOptInfo *index,
 											  Path *subpath,
-											  List *unique_exprs);
+											  int prefix);
 extern AggPath *create_agg_path(PlannerInfo *root,
 								RelOptInfo *rel,
 								Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 0cb8030e33..f934f0011a 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -198,6 +198,10 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   Relids required_outer,
 													   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
+extern int find_index_prefix_for_pathkey(PlannerInfo *root,
+					 IndexOptInfo *index,
+					 ScanDirection scandir,
+					 PathKey *pathkey);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 								  ScanDirection scandir);
 extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 4da9594925..e64e20a8cb 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -91,7 +91,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(20 rows)
+(19 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
-- 
2.25.0

florisvannee@Optiver.com

over 5 years ago

In reply to: Floris Van Nee (#8)

7 attachment(s)

RE: Index Skip Scan (new UniqueKeys)

I've attached the first two unique-keys patches (v9, 0001, 0002)), your
patches, but rebased on v9 of unique keys (0003-0006) + a diff patch (0007)
that applies my suggested changes on top of it.

I just realized there's another thing that looks a bit strange too. From reading the thread, I thought it should be the case that in create_distinct_paths, it is checked whether the uniqueness in the unique_pathlist matches the uniqueness that is needed by the query.
This means that I think what it should be comparing is this:
- The generated index path should have some path-level unique keys set
- The path-level unique keys must be at least as strict as the path-level unique keys. Eg. if path-level is unique on (a), then query-level must be (a), or possibly (a,b).

I've changed the patch to compare the path-level keys (set in create index path) with the query-level keys in create_distinct_path. Currently, I don't think the previous implementation was an actual issue leading to incorrect queries, but it would be causing problems if we tried to extend the uniqueness for distinct to join rels etc.

One question about the unique keys - probably for Andy or David: I've looked in the archives to find arguments for/against using Expr nodes or EquivalenceClasses in the Unique Keys patch. However, I couldn't really find a clear answer about why the current patch uses Expr rather than EquivalenceClasses. At some point David mentioned "that probably Expr nodes were needed rather than EquivalenceClasses", but it's not really clear to me why. What were the thoughts behind this?

-Floris

Attachments:

0001-Introduce-RelOptInfo-notnullattrs-attribute.patchapplication/octet-stream; name=0001-Introduce-RelOptInfo-notnullattrs-attribute.patchDownload

From c46a3317ed53ad8e24d1ed1f6b95b0687071f6c6 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Sun, 3 May 2020 22:37:46 +0800
Subject: [PATCH 1/7] Introduce RelOptInfo->notnullattrs attribute

The notnullattrs is calculated from catalog and run-time query. That
infomation is translated to child relation as well for partitioned
table.
---
 src/backend/optimizer/path/allpaths.c  | 31 ++++++++++++++++++++++++++
 src/backend/optimizer/plan/initsplan.c | 10 +++++++++
 src/backend/optimizer/util/plancat.c   | 10 +++++++++
 src/include/nodes/pathnodes.h          |  2 ++
 4 files changed, 53 insertions(+)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index c4e1967f12..9dade061c8 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1005,6 +1005,7 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 		RelOptInfo *childrel;
 		ListCell   *parentvars;
 		ListCell   *childvars;
+		int i = -1;
 
 		/* append_rel_list contains all append rels; ignore others */
 		if (appinfo->parent_relid != parentRTindex)
@@ -1061,6 +1062,36 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 								   (Node *) rel->reltarget->exprs,
 								   1, &appinfo);
 
+		/* Copy notnullattrs. */
+		while ((i = bms_next_member(rel->notnullattrs, i)) > 0)
+		{
+			AttrNumber attno = i + FirstLowInvalidHeapAttributeNumber;
+			AttrNumber child_attno;
+			if (attno == 0)
+			{
+				/* Whole row is not null, so must be same for child */
+				childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+														attno - FirstLowInvalidHeapAttributeNumber);
+				break;
+			}
+			if (attno < 0 )
+				/* no need to translate system column */
+				child_attno = attno;
+			else
+			{
+				Node * node = list_nth(appinfo->translated_vars, attno - 1);
+				if (!IsA(node, Var))
+					/* This may happens at UNION case, like (SELECT a FROM t1 UNION SELECT a + 3
+					 * FROM t2) t and we know t.a is not null
+					 */
+					continue;
+				child_attno = castNode(Var, node)->varattno;
+			}
+
+			childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+													child_attno - FirstLowInvalidHeapAttributeNumber);
+		}
+
 		/*
 		 * We have to make child entries in the EquivalenceClass data
 		 * structures as well.  This is needed either if the parent
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index e978b491f6..95b1b14cd3 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -830,6 +830,16 @@ deconstruct_recurse(PlannerInfo *root, Node *jtnode, bool below_outer_join,
 		{
 			Node	   *qual = (Node *) lfirst(l);
 
+			/* Set the not null info now */
+			ListCell	*lc;
+			List		*non_nullable_vars = find_nonnullable_vars(qual);
+			foreach(lc, non_nullable_vars)
+			{
+				Var *var = lfirst_node(Var, lc);
+				RelOptInfo *rel = root->simple_rel_array[var->varno];
+				rel->notnullattrs = bms_add_member(rel->notnullattrs,
+												   var->varattno - FirstLowInvalidHeapAttributeNumber);
+			}
 			distribute_qual_to_rels(root, qual,
 									false, below_outer_join, JOIN_INNER,
 									root->qual_security_level,
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 25545029d7..0b2f9d398a 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -117,6 +117,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	Relation	relation;
 	bool		hasindex;
 	List	   *indexinfos = NIL;
+	int			i;
 
 	/*
 	 * We need not lock the relation since it was already locked, either by
@@ -463,6 +464,15 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	if (inhparent && relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		set_relation_partition_info(root, rel, relation);
 
+	Assert(rel->notnullattrs == NULL);
+	for(i = 0; i < relation->rd_att->natts; i++)
+	{
+		FormData_pg_attribute attr = relation->rd_att->attrs[i];
+		if (attr.attnotnull)
+			rel->notnullattrs = bms_add_member(rel->notnullattrs,
+											   attr.attnum - FirstLowInvalidHeapAttributeNumber);
+	}
+
 	table_close(relation, NoLock);
 
 	/*
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 485d1b06c9..9e3ebd488a 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -709,6 +709,8 @@ typedef struct RelOptInfo
 	PlannerInfo *subroot;		/* if subquery */
 	List	   *subplan_params; /* if subquery */
 	int			rel_parallel_workers;	/* wanted number of parallel workers */
+	/* Not null attrs, start from -FirstLowInvalidHeapAttributeNumber */
+	Bitmapset		*notnullattrs;
 
 	/* Information about foreign tables and foreign joins */
 	Oid			serverid;		/* identifies server for the table or join */
-- 
2.25.0

0002-Introduce-UniqueKey-attributes-on-RelOptInfo-struct.patchapplication/octet-stream; name=0002-Introduce-UniqueKey-attributes-on-RelOptInfo-struct.patchDownload

From b7f578b700956374312885fbc5060a23ab64da51 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Mon, 11 May 2020 15:50:52 +0800
Subject: [PATCH 2/7] Introduce UniqueKey attributes on RelOptInfo struct.

UniqueKey is a set of exprs on RelOptInfo which represents the exprs
will be unique on the given RelOptInfo. You can see README.uniquekey
for more information.
---
 src/backend/nodes/copyfuncs.c               |   13 +
 src/backend/nodes/list.c                    |   31 +
 src/backend/nodes/makefuncs.c               |   13 +
 src/backend/nodes/outfuncs.c                |   11 +
 src/backend/nodes/readfuncs.c               |   10 +
 src/backend/optimizer/path/Makefile         |    3 +-
 src/backend/optimizer/path/README.uniquekey |  131 +++
 src/backend/optimizer/path/allpaths.c       |   10 +
 src/backend/optimizer/path/joinpath.c       |    9 +-
 src/backend/optimizer/path/joinrels.c       |    2 +
 src/backend/optimizer/path/pathkeys.c       |    3 +-
 src/backend/optimizer/path/uniquekeys.c     | 1131 +++++++++++++++++++
 src/backend/optimizer/plan/planner.c        |   13 +-
 src/backend/optimizer/prep/prepunion.c      |    2 +
 src/backend/optimizer/util/appendinfo.c     |   44 +
 src/backend/optimizer/util/inherit.c        |   16 +-
 src/include/nodes/makefuncs.h               |    3 +
 src/include/nodes/nodes.h                   |    1 +
 src/include/nodes/pathnodes.h               |   29 +-
 src/include/nodes/pg_list.h                 |    2 +
 src/include/optimizer/appendinfo.h          |    3 +
 src/include/optimizer/optimizer.h           |    2 +
 src/include/optimizer/paths.h               |   43 +
 23 files changed, 1502 insertions(+), 23 deletions(-)
 create mode 100644 src/backend/optimizer/path/README.uniquekey
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d8cf87e6d0..efe7c3ae10 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2273,6 +2273,16 @@ _copyPathKey(const PathKey *from)
 	return newnode;
 }
 
+static UniqueKey *
+_copyUniqueKey(const UniqueKey *from)
+{
+	UniqueKey	*newnode = makeNode(UniqueKey);
+
+	COPY_NODE_FIELD(exprs);
+	COPY_SCALAR_FIELD(multi_nullvals);
+
+	return newnode;
+}
 /*
  * _copyRestrictInfo
  */
@@ -5152,6 +5162,9 @@ copyObjectImpl(const void *from)
 		case T_PathKey:
 			retval = _copyPathKey(from);
 			break;
+		case T_UniqueKey:
+			retval = _copyUniqueKey(from);
+			break;
 		case T_RestrictInfo:
 			retval = _copyRestrictInfo(from);
 			break;
diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index 80fa8c84e4..a7a99b70f2 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -687,6 +687,37 @@ list_member_oid(const List *list, Oid datum)
 	return false;
 }
 
+/*
+ * return true iff every entry in "members" list is also present
+ * in the "target" list.
+ */
+bool
+list_is_subset(const List *members, const List *target)
+{
+	const ListCell	*lc1, *lc2;
+
+	Assert(IsPointerList(members));
+	Assert(IsPointerList(target));
+	check_list_invariants(members);
+	check_list_invariants(target);
+
+	foreach(lc1, members)
+	{
+		bool found = false;
+		foreach(lc2, target)
+		{
+			if (equal(lfirst(lc1), lfirst(lc2)))
+			{
+				found = true;
+				break;
+			}
+		}
+		if (!found)
+			return false;
+	}
+	return true;
+}
+
 /*
  * Delete the n'th cell (counting from 0) in list.
  *
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 49de285f01..646cf7c9a1 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -814,3 +814,16 @@ makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols)
 	v->va_cols = va_cols;
 	return v;
 }
+
+
+/*
+ * makeUniqueKey
+ */
+UniqueKey*
+makeUniqueKey(List *exprs, bool multi_nullvals)
+{
+	UniqueKey * ukey = makeNode(UniqueKey);
+	ukey->exprs = exprs;
+	ukey->multi_nullvals = multi_nullvals;
+	return ukey;
+}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515d..c3a9632992 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2428,6 +2428,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+	WRITE_NODE_FIELD(exprs);
+	WRITE_BOOL_FIELD(multi_nullvals);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4127,6 +4135,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 				_outPathKey(str, obj);
 				break;
+			case T_UniqueKey:
+				_outUniqueKey(str, obj);
+				break;
 			case T_PathTarget:
 				_outPathTarget(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab719..3a18571d0c 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -452,6 +452,14 @@ _readSetOperationStmt(void)
 	READ_DONE();
 }
 
+static UniqueKey *
+_readUniqueKey(void)
+{
+	READ_LOCALS(UniqueKey);
+	READ_NODE_FIELD(exprs);
+	READ_BOOL_FIELD(multi_nullvals);
+	READ_DONE();
+}
 
 /*
  *	Stuff from primnodes.h.
@@ -2656,6 +2664,8 @@ parseNodeString(void)
 		return_value = _readCommonTableExpr();
 	else if (MATCH("SETOPERATIONSTMT", 16))
 		return_value = _readSetOperationStmt();
+	else if (MATCH("UNIQUEKEY", 9))
+		return_value = _readUniqueKey();
 	else if (MATCH("ALIAS", 5))
 		return_value = _readAlias();
 	else if (MATCH("RANGEVAR", 8))
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..7b9820c25f 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekeys.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/README.uniquekey b/src/backend/optimizer/path/README.uniquekey
new file mode 100644
index 0000000000..5eac761995
--- /dev/null
+++ b/src/backend/optimizer/path/README.uniquekey
@@ -0,0 +1,131 @@
+1. What is UniqueKey?
+We can think UniqueKey is a set of exprs for a RelOptInfo, which we are insure
+that doesn't yields same result among all the rows. The simplest UniqueKey
+format is primary key.
+
+However we define the UnqiueKey as below.
+
+typedef struct UniqueKey
+{
+        NodeTag	type;
+        List	*exprs;
+        bool	multi_nullvals;
+} UniqueKey;
+
+exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+is a special case of UniqueKey, which means there is only one row in that
+relation.it has a stronger semantic than others. like SELECT uk FROM t; uk is
+normal unique key and may have different values. SELECT colx FROM t WHERE uk =
+const.  colx is unique AND we have only 1 value. This field can used for
+innerrel_is_unique. this logic is handled specially in add_uniquekey_for_onerow
+function.
+
+multi_nullvals: true means multi null values may exist in these exprs, so the
+uniqueness is not guaranteed in this case. This field is necessary for
+remove_useless_join & reduce_unique_semijoins where we don't mind these
+duplicated NULL values. It is set to true for 2 cases. One is a unique key
+from a unique index but the related column is nullable. The other one is for
+outer join. see populate_joinrel_uniquekeys for detail.
+
+
+The UniqueKey can be used at the following cases at least:
+1. remove_useless_joins.
+2. reduce_semianti_joins
+3. remove distinct node if distinct clause is unique.
+4. remove aggnode if group by clause is unique.
+5. Index Skip Scan (WIP)
+6. Aggregation Push Down without 2 phase aggregation if the join can't
+   duplicated the aggregated rows. (work in progress feature)
+
+2. How is it maintained?
+
+We have a set of populate_xxx_unqiuekeys functions to maintain the uniquekey on
+various cases. xxx includes baserel, joinrel, partitionedrel, distinctrel,
+groupedrel, unionrel. and we also need to convert the uniquekey from subquery
+to outer relation, which is what convert_subquery_uniquekeys does.
+
+1. The first part is about baserel. We handled 3 cases. suppose we have Unique
+Index on (a, b).
+
+1. SELECT a, b FROM t.  UniqueKey (a, b)
+2. SELECT a FROM t WHERE b = 1;  UniqueKey (a)
+3. SELECT .. FROM t WHERE a = 1 AND b = 1;  UniqueKey (NIL).  onerow case, every
+   column is Unique.
+
+2. The next part is joinrel, this part is most error-prone, we simplified the rules
+like below:
+1. If the relation's UniqueKey can't be duplicated after join,  then is will be
+   still valid for the join rel. The function we used here is
+   innerrel_keeps_unique. The basic idea is innerrel.any_col = outer.uk.
+
+2. If the UnqiueKey can't keep valid via the rule 1, the combination of the
+   UniqueKey from both sides are valid for sure.  We can prove this as: if the
+   unique exprs from rel1 is duplicated by rel2, the duplicated rows must
+   contains different unique exprs from rel2.
+
+More considerations about onerow:
+1. If relation with one row and it can't be duplicated, it is still possible
+   contains mulit_nullvas after outer join.
+2. If the either UniqueKey can be duplicated after join, the can get one row
+   only when both side is one row AND there is no outer join.
+3. Whenever the onerow UniqueKey is not a valid any more, we need to convert one
+   row UniqueKey to normal unique key since we don't store exprs for one-row
+   relation. get_exprs_from_uniquekeys will be used here.
+
+
+More considerations about multi_nullvals after join:
+1. If the original UnqiueKey has multi_nullvals, the final UniqueKey will have
+   mulit_nullvals in any case.
+2. If a unique key doesn't allow mulit_nullvals, after some outer join, it
+   allows some outer join.
+
+
+3. When we comes to subquery, we need to convert_subquery_unqiuekeys just like
+convert_subquery_pathkeys.  Only the UniqueKey insides subquery is referenced as
+a Var in outer relation will be reused. The relationship between the outerrel.Var
+and subquery.exprs is built with outerel->subroot->processed_tlist.
+
+
+4. As for the SRF functions, it will break the uniqueness of uniquekey, However it
+is handled in adjust_paths_for_srfs, which happens after the query_planner. so
+we will maintain the UniqueKey until there and reset it to NIL at that
+places. This can't help on distinct/group by elimination cases but probably help
+in some other cases, like reduce_unqiue_semijoins/remove_useless_joins and it is
+semantic correctly.
+
+
+5. As for inherit table, we first main the UnqiueKey on childrel as well. But for
+partitioned table we need to maintain 2 different kinds of
+UnqiueKey. 1). UniqueKey on the parent relation 2). UniqueKey on child
+relation for partition wise query.
+
+Example:
+CREATE TABLE p (a int not null, b int not null) partition by list (a);
+CREATE TABLE p0 partition of p for values in (1);
+CREATE TABLE p1 partition of p for values in (2);
+
+create unique index p0_b on p0(b);
+create unique index p1_b on p1(b);
+
+Now b is only unique on partition level, so the distinct can't be removed on
+the following cases. SELECT DISTINCT b FROM p;
+
+Another example is SELECT DISTINCT a, b FROM p WHERE a = 1; Since only one
+partition is chosen, the UniqueKey on child relation is same as the UniqueKey on
+parent relation.
+
+Another usage of UniqueKey on partition level is it be helpful for
+partition-wise join.
+
+As for the UniqueKey on parent table level, it comes with 2 different ways,
+1). the UniqueKey is also derived in UniqueKey index, but the index must be same
+in all the related children relations and the unique index must contains
+Partition Key in it. Example:
+
+CREATE UNIQUE INDEX p_ab ON p(a, b);  -- where a is the partition key.
+
+-- Query
+SELECT a, b FROM p; the (a, b) is a UniqueKey of p.
+
+2). If the parent relation has only one childrel, the UniqueKey on childrel is
+ the UniqueKey on parent as well.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 9dade061c8..b58bac6175 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -579,6 +579,12 @@ set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	 */
 	check_index_predicates(root, rel);
 
+	/*
+	 * Now that we've marked which partial indexes are suitable, we can now
+	 * build the relation's unique keys.
+	 */
+	populate_baserel_uniquekeys(root, rel, rel->indexlist);
+
 	/* Mark rel with estimated output rows, width, etc */
 	set_baserel_size_estimates(root, rel);
 }
@@ -1310,6 +1316,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 
 	/* Add paths to the append relation. */
 	add_paths_to_append_rel(root, rel, live_childrels);
+	if (IS_PARTITIONED_REL(rel))
+		populate_partitionedrel_uniquekeys(root, rel, live_childrels);
 }
 
 
@@ -2383,6 +2391,8 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 										  pathkeys, required_outer));
 	}
 
+	convert_subquery_uniquekeys(root, rel, sub_final_rel);
+
 	/* If outer rel allows parallelism, do same for partial paths. */
 	if (rel->consider_parallel && bms_is_empty(required_outer))
 	{
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index db54a6ba2e..ef0fd2fb0b 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -71,13 +71,6 @@ static void consider_parallel_mergejoin(PlannerInfo *root,
 static void hash_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 								 RelOptInfo *outerrel, RelOptInfo *innerrel,
 								 JoinType jointype, JoinPathExtraData *extra);
-static List *select_mergejoin_clauses(PlannerInfo *root,
-									  RelOptInfo *joinrel,
-									  RelOptInfo *outerrel,
-									  RelOptInfo *innerrel,
-									  List *restrictlist,
-									  JoinType jointype,
-									  bool *mergejoin_allowed);
 static void generate_mergejoin_paths(PlannerInfo *root,
 									 RelOptInfo *joinrel,
 									 RelOptInfo *innerrel,
@@ -1927,7 +1920,7 @@ hash_inner_and_outer(PlannerInfo *root,
  * if it is mergejoinable and involves vars from the two sub-relations
  * currently of interest.
  */
-static List *
+List *
 select_mergejoin_clauses(PlannerInfo *root,
 						 RelOptInfo *joinrel,
 						 RelOptInfo *outerrel,
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 2d343cd293..b9163ee8ff 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -924,6 +924,8 @@ populate_joinrel_with_paths(PlannerInfo *root, RelOptInfo *rel1,
 
 	/* Apply partitionwise join technique, if possible. */
 	try_partitionwise_join(root, rel1, rel2, joinrel, sjinfo, restrictlist);
+
+	populate_joinrel_uniquekeys(root, joinrel, rel1, rel2, restrictlist, sjinfo->jointype);
 }
 
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ce9bf87e9b..7e596d4194 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -33,7 +33,6 @@ static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
 											 int partkeycol);
-static Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 static bool right_merge_direction(PlannerInfo *root, PathKey *pathkey);
 
 
@@ -1035,7 +1034,7 @@ convert_subquery_pathkeys(PlannerInfo *root, RelOptInfo *rel,
  * We need this to ensure that we don't return pathkeys describing values
  * that are unavailable above the level of the subquery scan.
  */
-static Var *
+Var *
 find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle)
 {
 	ListCell   *lc;
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
new file mode 100644
index 0000000000..b33bcd2f32
--- /dev/null
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -0,0 +1,1131 @@
+/*-------------------------------------------------------------------------
+ *
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/uniquekeys.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/appendinfo.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/tlist.h"
+#include "rewrite/rewriteManip.h"
+
+
+/*
+ * This struct is used to help populate_joinrel_uniquekeys.
+ *
+ * added_to_joinrel is true if a uniquekey (from outerrel or innerrel)
+ * has been added to joinrel.
+ * useful is true if the exprs of the uniquekey still appears in joinrel.
+ */
+typedef struct UniqueKeyContextData
+{
+	UniqueKey	*uniquekey;
+	bool	added_to_joinrel;
+	bool	useful;
+} *UniqueKeyContext;
+
+static List *initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel);
+static bool innerrel_keeps_unique(PlannerInfo *root,
+								  RelOptInfo *outerrel,
+								  RelOptInfo *innerrel,
+								  List *restrictlist,
+								  bool reverse);
+
+static List *get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+										List *const_exprs,
+										List *const_expr_opfamilies,
+										Bitmapset *used_varattrs,
+										bool *useful,
+										bool *multi_nullvals);
+static List *get_exprs_from_uniquekey(RelOptInfo *joinrel,
+									  RelOptInfo *rel1,
+									  UniqueKey *ukey);
+static void add_uniquekey_for_onerow(RelOptInfo *rel);
+static bool add_combined_uniquekey(RelOptInfo *joinrel,
+								   RelOptInfo *outer_rel,
+								   RelOptInfo *inner_rel,
+								   UniqueKey *outer_ukey,
+								   UniqueKey *inner_ukey,
+								   JoinType jointype);
+
+/* Used for unique indexes checking for partitioned table */
+static bool index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind);
+static IndexOptInfo *simple_copy_indexinfo_to_parent(PlannerInfo *root,
+													 RelOptInfo *parentrel,
+													 IndexOptInfo *from);
+static bool simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2);
+static void adjust_partition_unique_indexlist(PlannerInfo *root,
+											  RelOptInfo *parentrel,
+											  RelOptInfo *childrel,
+											  List **global_unique_index);
+
+/* Helper function for grouped relation and distinct relation. */
+static void add_uniquekey_from_sortgroups(PlannerInfo *root,
+										  RelOptInfo *rel,
+										  List *sortgroups);
+
+/*
+ * populate_baserel_uniquekeys
+ *		Populate 'baserel' uniquekeys list by looking at the rel's unique index
+ * and baserestrictinfo
+ */
+void
+populate_baserel_uniquekeys(PlannerInfo *root,
+							RelOptInfo *baserel,
+							List *indexlist)
+{
+	ListCell *lc;
+	List	*matched_uniq_indexes = NIL;
+
+	/* Attrs appears in rel->reltarget->exprs. */
+	Bitmapset *used_attrs = NULL;
+
+	List	*const_exprs = NIL;
+	List	*expr_opfamilies = NIL;
+
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	foreach(lc, indexlist)
+	{
+		IndexOptInfo *ind = (IndexOptInfo *) lfirst(lc);
+		if (!ind->unique || !ind->immediate ||
+			(ind->indpred != NIL && !ind->predOK))
+			continue;
+		matched_uniq_indexes = lappend(matched_uniq_indexes, ind);
+	}
+
+	if (matched_uniq_indexes  == NIL)
+		return;
+
+	/* Check which attrs is used in baserel->reltarget */
+	pull_varattnos((Node *)baserel->reltarget->exprs, baserel->relid, &used_attrs);
+
+	/* Check which attrno is used at a mergeable const filter */
+	foreach(lc, baserel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+		if (rinfo->mergeopfamilies == NIL)
+			continue;
+
+		if (bms_is_empty(rinfo->left_relids))
+		{
+			const_exprs = lappend(const_exprs, get_rightop(rinfo->clause));
+		}
+		else if (bms_is_empty(rinfo->right_relids))
+		{
+			const_exprs = lappend(const_exprs, get_leftop(rinfo->clause));
+		}
+		else
+			continue;
+
+		expr_opfamilies = lappend(expr_opfamilies, rinfo->mergeopfamilies);
+	}
+
+	foreach(lc, matched_uniq_indexes)
+	{
+		bool	multi_nullvals, useful;
+		List	*exprs = get_exprs_from_uniqueindex(lfirst_node(IndexOptInfo, lc),
+													const_exprs,
+													expr_opfamilies,
+													used_attrs,
+													&useful,
+													&multi_nullvals);
+		if (useful)
+		{
+			if (exprs == NIL)
+			{
+				/* All the columns in Unique Index matched with a restrictinfo */
+				add_uniquekey_for_onerow(baserel);
+				return;
+			}
+			baserel->uniquekeys = lappend(baserel->uniquekeys,
+										  makeUniqueKey(exprs, multi_nullvals));
+		}
+	}
+}
+
+
+/*
+ * populate_partitionedrel_uniquekeys
+ * The UniqueKey on partitionrel comes from 2 cases:
+ * 1). Only one partition is involved in this query, the unique key can be
+ * copied to parent rel from childrel.
+ * 2). There are some unique index which includes partition key and exists
+ * in all the related partitions.
+ * We never mind rule 2 if we hit rule 1.
+ */
+
+void
+populate_partitionedrel_uniquekeys(PlannerInfo *root,
+								   RelOptInfo *rel,
+								   List *childrels)
+{
+	ListCell	*lc;
+	List	*global_uniq_indexlist = NIL;
+	RelOptInfo *childrel;
+	bool is_first = true;
+
+	Assert(IS_PARTITIONED_REL(rel));
+
+	if (childrels == NIL)
+		return;
+
+	/*
+	 * If there is only one partition used in this query, the UniqueKey in childrel is
+	 * still valid in parent level, but we need convert the format from child expr to
+	 * parent expr.
+	 */
+	if (list_length(childrels) == 1)
+	{
+		/* Check for Rule 1 */
+		RelOptInfo *childrel = linitial_node(RelOptInfo, childrels);
+		ListCell	*lc;
+		Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL);
+		if (relation_is_onerow(childrel))
+		{
+			add_uniquekey_for_onerow(rel);
+			return;
+		}
+
+		foreach(lc, childrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			AppendRelInfo *appinfo = find_appinfo_by_child(root, childrel->relid);
+			List *parent_exprs = NIL;
+			bool can_reuse = true;
+			ListCell	*lc2;
+			foreach(lc2, ukey->exprs)
+			{
+				Var *var = (Var *)lfirst(lc2);
+				/*
+				 * If the expr comes from a expression, it is hard to build the expression
+				 * in parent so ignore that case for now.
+				 */
+				if(!IsA(var, Var))
+				{
+					can_reuse = false;
+					break;
+				}
+				/* Convert it to parent var */
+				parent_exprs = lappend(parent_exprs, find_parent_var(appinfo, var));
+			}
+			if (can_reuse)
+				rel->uniquekeys = lappend(rel->uniquekeys,
+										  makeUniqueKey(parent_exprs,
+														ukey->multi_nullvals));
+		}
+	}
+	else
+	{
+		/* Check for rule 2 */
+		childrel = linitial_node(RelOptInfo, childrels);
+		foreach(lc, childrel->indexlist)
+		{
+			IndexOptInfo *ind = lfirst(lc);
+			IndexOptInfo *modified_index;
+			if (!ind->unique || !ind->immediate ||
+				(ind->indpred != NIL && !ind->predOK))
+				continue;
+
+			/*
+			 * During simple_copy_indexinfo_to_parent, we need to convert var from
+			 * child var to parent var, index on expression is too complex to handle.
+			 * so ignore it for now.
+			 */
+			if (ind->indexprs != NIL)
+				continue;
+
+			modified_index = simple_copy_indexinfo_to_parent(root, rel, ind);
+			/*
+			 * If the unique index doesn't contain partkey, then it is unique
+			 * on this partition only, so it is useless for us.
+			 */
+			if (!index_constains_partkey(rel, modified_index))
+				continue;
+
+			global_uniq_indexlist = lappend(global_uniq_indexlist,  modified_index);
+		}
+
+		if (global_uniq_indexlist != NIL)
+		{
+			foreach(lc, childrels)
+			{
+				RelOptInfo *child = lfirst(lc);
+				if (is_first)
+				{
+					is_first = false;
+					continue;
+				}
+				adjust_partition_unique_indexlist(root, rel, child, &global_uniq_indexlist);
+			}
+			/* Now we have a list of unique index which are exactly same on all childrels,
+			 * Set the UniqueKey just like it is non-partition table
+			 */
+			populate_baserel_uniquekeys(root, rel, global_uniq_indexlist);
+		}
+	}
+}
+
+
+/*
+ * populate_distinctrel_uniquekeys
+ */
+void
+populate_distinctrel_uniquekeys(PlannerInfo *root,
+								RelOptInfo *inputrel,
+								RelOptInfo *distinctrel)
+{
+	/* The unique key before the distinct is still valid. */
+	distinctrel->uniquekeys = list_copy(inputrel->uniquekeys);
+	add_uniquekey_from_sortgroups(root, distinctrel, root->parse->distinctClause);
+}
+
+/*
+ * populate_grouprel_uniquekeys
+ */
+void
+populate_grouprel_uniquekeys(PlannerInfo *root,
+							 RelOptInfo *grouprel,
+							 RelOptInfo *inputrel)
+
+{
+	Query *parse = root->parse;
+	bool input_ukey_added = false;
+	ListCell *lc;
+
+	if (relation_is_onerow(inputrel))
+	{
+		add_uniquekey_for_onerow(grouprel);
+		return;
+	}
+	if (parse->groupingSets)
+		return;
+
+	/* A Normal group by without grouping set. */
+	if (parse->groupClause)
+	{
+		/*
+		 * Current even the groupby clause is Unique already, but if query has aggref
+		 * We have to create grouprel still. To keep the UnqiueKey short, we will check
+		 * the UniqueKey of input_rel still valid, if so we reuse it.
+		 */
+		foreach(lc, inputrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(ukey->exprs, grouprel->reltarget->exprs))
+			{
+				grouprel->uniquekeys = lappend(grouprel->uniquekeys,
+											   ukey);
+				input_ukey_added = true;
+			}
+		}
+		if (!input_ukey_added)
+			/*
+			 * group by clause must be a super-set of grouprel->reltarget->exprs except the
+			 * aggregation expr, so if such exprs is unique already, no bother to generate
+			 * new uniquekey for group by exprs.
+			 */
+			add_uniquekey_from_sortgroups(root,
+										  grouprel,
+										  root->parse->groupClause);
+	}
+	else
+		/* It has aggregation but without a group by, so only one row returned */
+		add_uniquekey_for_onerow(grouprel);
+}
+
+/*
+ * simple_copy_uniquekeys
+ * Using a function for the one-line code makes us easy to check where we simply
+ * copied the uniquekey.
+ */
+void
+simple_copy_uniquekeys(RelOptInfo *oldrel,
+					   RelOptInfo *newrel)
+{
+	newrel->uniquekeys = oldrel->uniquekeys;
+}
+
+/*
+ *  populate_unionrel_uniquekeys
+ */
+void
+populate_unionrel_uniquekeys(PlannerInfo *root,
+							  RelOptInfo *unionrel)
+{
+	ListCell	*lc;
+	List	*exprs = NIL;
+
+	Assert(unionrel->uniquekeys == NIL);
+
+	foreach(lc, unionrel->reltarget->exprs)
+	{
+		exprs = lappend(exprs, lfirst(lc));
+	}
+
+	if (exprs == NIL)
+		/* SQL: select union select; is valid, we need to handle it here. */
+		add_uniquekey_for_onerow(unionrel);
+	else
+		unionrel->uniquekeys = lappend(unionrel->uniquekeys,
+									   makeUniqueKey(exprs,false));
+
+}
+
+/*
+ * populate_joinrel_uniquekeys
+ *
+ * populate uniquekeys for joinrel. We will check each relation to see if its
+ * UniqueKey is still valid via innerrel_keeps_unique, if so, we add it to
+ * joinrel.  The multi_nullvals field will be changed to true for some outer
+ * join cases and one-row UniqueKey needs to be converted to normal UniqueKey
+ * for the same case as well.
+ * For the uniquekey in either baserel which can't be unique after join, we still
+ * check to see if combination of UniqueKeys from both side is still useful for us.
+ * if yes, we add it to joinrel as well.
+ */
+void
+populate_joinrel_uniquekeys(PlannerInfo *root, RelOptInfo *joinrel,
+							RelOptInfo *outerrel, RelOptInfo *innerrel,
+							List *restrictlist, JoinType jointype)
+{
+	ListCell *lc, *lc2;
+	List	*clause_list = NIL;
+	List	*outerrel_ukey_ctx;
+	List	*innerrel_ukey_ctx;
+	bool	inner_onerow, outer_onerow;
+	bool	mergejoin_allowed;
+
+	/* Care about the outerrel relation only for SEMI/ANTI join */
+	if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
+	{
+		foreach(lc, outerrel->uniquekeys)
+		{
+			UniqueKey	*uniquekey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(uniquekey->exprs, joinrel->reltarget->exprs))
+				joinrel->uniquekeys = lappend(joinrel->uniquekeys, uniquekey);
+		}
+		return;
+	}
+
+	Assert(jointype == JOIN_LEFT || jointype == JOIN_FULL || jointype == JOIN_INNER);
+
+	/* Fast path */
+	if (innerrel->uniquekeys == NIL || outerrel->uniquekeys == NIL)
+		return;
+
+	inner_onerow = relation_is_onerow(innerrel);
+	outer_onerow = relation_is_onerow(outerrel);
+
+	outerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(outerrel);
+	innerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(innerrel);
+
+	clause_list = select_mergejoin_clauses(root, joinrel, outerrel, innerrel,
+										   restrictlist, jointype,
+										   &mergejoin_allowed);
+
+	if (innerrel_keeps_unique(root, innerrel, outerrel, clause_list, true /* reverse */))
+	{
+		bool outer_impact = jointype == JOIN_FULL;
+		foreach(lc, outerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			/* Outer relation has one row, and the unique key is not duplicated after join,
+			 * the joinrel will still has one row unless the jointype == JOIN_FULL.
+			 */
+			if (outer_onerow && !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (outer_onerow)
+			{
+				/*
+				 * The onerow outerrel becomes multi rows and multi_nullvals
+				 * will be changed to true. We also need to set the exprs correctly since it
+				 * can't be NIL any more.
+				 */
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(joinrel, outerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Change multi_nullvals to true due to the full join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs, true));
+				else
+					/* Just reuse it */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	if (innerrel_keeps_unique(root, outerrel, innerrel, clause_list, false))
+	{
+		bool outer_impact = jointype == JOIN_FULL || jointype == JOIN_LEFT;;
+
+		foreach(lc, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			if (inner_onerow &&  !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (inner_onerow)
+			{
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(joinrel, innerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Need to change multi_nullvals to true due to the outer join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs,
+																true));
+				else
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	/*
+	 * The combination of the UniqueKey from both sides is unique as well regardless
+	 * of join type, but no bother to add it if its subset has been added to joinrel
+	 * already or it is not useful for the joinrel.
+	 */
+	foreach(lc, outerrel_ukey_ctx)
+	{
+		UniqueKeyContext ctx1 = (UniqueKeyContext) lfirst(lc);
+		if (ctx1->added_to_joinrel || !ctx1->useful)
+			continue;
+		foreach(lc2, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx2 = (UniqueKeyContext) lfirst(lc2);
+			if (ctx2->added_to_joinrel || !ctx2->useful)
+				continue;
+			if (add_combined_uniquekey(joinrel, outerrel, innerrel,
+									   ctx1->uniquekey, ctx2->uniquekey,
+									   jointype))
+				/* If we set a onerow UniqueKey to joinrel, we don't need other. */
+				return;
+		}
+	}
+}
+
+
+/*
+ * convert_subquery_uniquekeys
+ *
+ * Covert the UniqueKey in subquery to outer relation.
+ */
+void convert_subquery_uniquekeys(PlannerInfo *root,
+								 RelOptInfo *currel,
+								 RelOptInfo *sub_final_rel)
+{
+	ListCell	*lc;
+
+	if (sub_final_rel->uniquekeys == NIL)
+		return;
+
+	if (relation_is_onerow(sub_final_rel))
+	{
+		add_uniquekey_for_onerow(currel);
+		return;
+	}
+
+	Assert(currel->subroot != NULL);
+
+	foreach(lc, sub_final_rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		ListCell	*lc;
+		List	*exprs = NIL;
+		bool	ukey_useful = true;
+
+		/* One row case is handled above */
+		Assert(ukey->exprs != NIL);
+		foreach(lc, ukey->exprs)
+		{
+			Var *var;
+			TargetEntry *tle = tlist_member(lfirst(lc),
+											currel->subroot->processed_tlist);
+			if (tle == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			var = find_var_for_subquery_tle(currel, tle);
+			if (var == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			exprs = lappend(exprs, var);
+		}
+
+		if (ukey_useful)
+			currel->uniquekeys = lappend(currel->uniquekeys,
+										 makeUniqueKey(exprs,
+													   ukey->multi_nullvals));
+
+	}
+}
+
+/*
+ * innerrel_keeps_unique
+ *
+ * Check if Unique key of the innerrel is valid after join. innerrel's UniqueKey
+ * will be still valid if innerrel's any-column mergeop outrerel's uniquekey
+ * exists in clause_list.
+ *
+ * Note: the clause_list must be a list of mergeable restrictinfo already.
+ */
+static bool
+innerrel_keeps_unique(PlannerInfo *root,
+					  RelOptInfo *outerrel,
+					  RelOptInfo *innerrel,
+					  List *clause_list,
+					  bool reverse)
+{
+	ListCell	*lc, *lc2, *lc3;
+
+	if (outerrel->uniquekeys == NIL || innerrel->uniquekeys == NIL)
+		return false;
+
+	/* Check if there is outerrel's uniquekey in mergeable clause. */
+	foreach(lc, outerrel->uniquekeys)
+	{
+		List	*outer_uq_exprs = lfirst_node(UniqueKey, lc)->exprs;
+		bool clauselist_matchs_all_exprs = true;
+		foreach(lc2, outer_uq_exprs)
+		{
+			Node *outer_uq_expr = lfirst(lc2);
+			bool find_uq_expr_in_clauselist = false;
+			foreach(lc3, clause_list)
+			{
+				RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc3);
+				Node *outer_expr;
+				if (reverse)
+					outer_expr = rinfo->outer_is_left ? get_rightop(rinfo->clause) : get_leftop(rinfo->clause);
+				else
+					outer_expr = rinfo->outer_is_left ? get_leftop(rinfo->clause) : get_rightop(rinfo->clause);
+				if (equal(outer_expr, outer_uq_expr))
+				{
+					find_uq_expr_in_clauselist = true;
+					break;
+				}
+			}
+			if (!find_uq_expr_in_clauselist)
+			{
+				/* No need to check the next exprs in the current uniquekey */
+				clauselist_matchs_all_exprs = false;
+				break;
+			}
+		}
+
+		if (clauselist_matchs_all_exprs)
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * relation_is_onerow
+ * Check if it is a one-row relation by checking UniqueKey.
+ */
+bool
+relation_is_onerow(RelOptInfo *rel)
+{
+	UniqueKey *ukey;
+	if (rel->uniquekeys == NIL)
+		return false;
+	ukey = linitial_node(UniqueKey, rel->uniquekeys);
+	return ukey->exprs == NIL && list_length(rel->uniquekeys) == 1;
+}
+
+/*
+ * relation_has_uniquekeys_for
+ *		Returns true if we have proofs that 'rel' cannot return multiple rows with
+ *		the same values in each of 'exprs'.  Otherwise returns false.
+ */
+bool
+relation_has_uniquekeys_for(PlannerInfo *root, RelOptInfo *rel,
+							List *exprs, bool allow_multinulls)
+{
+	ListCell *lc;
+
+	/*
+	 * For UniqueKey->onerow case, the uniquekey->exprs is empty as well
+	 * so we can't rely on list_is_subset to handle this special cases
+	 */
+	if (exprs == NIL)
+		return false;
+
+	foreach(lc, rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		if (ukey->multi_nullvals && !allow_multinulls)
+			continue;
+		if (list_is_subset(ukey->exprs, exprs))
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * get_exprs_from_uniqueindex
+ *
+ * Return a list of exprs which is unique. set useful to false if this
+ * unique index is not useful for us.
+ */
+static List *
+get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+						   List *const_exprs,
+						   List *const_expr_opfamilies,
+						   Bitmapset *used_varattrs,
+						   bool *useful,
+						   bool *multi_nullvals)
+{
+	List	*exprs = NIL;
+	ListCell	*indexpr_item;
+	int	c = 0;
+
+	*useful = true;
+	*multi_nullvals = false;
+
+	indexpr_item = list_head(unique_index->indexprs);
+	for(c = 0; c < unique_index->ncolumns; c++)
+	{
+		int attr = unique_index->indexkeys[c];
+		Expr *expr;
+		bool	matched_const = false;
+		ListCell	*lc1, *lc2;
+
+		if(attr > 0)
+		{
+			expr = list_nth_node(TargetEntry, unique_index->indextlist, c)->expr;
+		}
+		else if (attr == 0)
+		{
+			/* Expression index */
+			expr = lfirst(indexpr_item);
+			indexpr_item = lnext(unique_index->indexprs, indexpr_item);
+		}
+		else /* attr < 0 */
+		{
+			/* Index on system column is not supported */
+			Assert(false);
+		}
+
+		/*
+		 * Check index_col = Const case with regarding to opfamily checking
+		 * If we can remove the index_col from the final UniqueKey->exprs.
+		 */
+		forboth(lc1, const_exprs, lc2, const_expr_opfamilies)
+		{
+			if (list_member_oid((List *)lfirst(lc2), unique_index->opfamily[c])
+				&& match_index_to_operand((Node *) lfirst(lc1), c, unique_index))
+			{
+				matched_const = true;
+				break;
+			}
+		}
+
+		if (matched_const)
+			continue;
+
+		/* Check if the indexed expr is used in rel */
+		if (attr > 0)
+		{
+			/*
+			 * Normal Indexed column, if the col is not used, then the index is useless
+			 * for uniquekey.
+			 */
+			attr -= FirstLowInvalidHeapAttributeNumber;
+
+			if (!bms_is_member(attr, used_varattrs))
+			{
+				*useful = false;
+				break;
+			}
+		}
+		else if (!list_member(unique_index->rel->reltarget->exprs, expr))
+		{
+			/* Expression index but the expression is not used in rel */
+			*useful = false;
+			break;
+		}
+
+		/* check not null property. */
+		if (attr == 0)
+		{
+			/* We never know if a expression yields null or not */
+			*multi_nullvals = true;
+		}
+		else if (!bms_is_member(attr, unique_index->rel->notnullattrs)
+				 && !bms_is_member(0 - FirstLowInvalidHeapAttributeNumber,
+								   unique_index->rel->notnullattrs))
+		{
+			*multi_nullvals = true;
+		}
+
+		exprs = lappend(exprs, expr);
+	}
+	return exprs;
+}
+
+
+/*
+ * add_uniquekey_for_onerow
+ * If we are sure that the relation only returns one row, then all the columns
+ * are unique. However we don't need to create UniqueKey for every column, we
+ * just set exprs = NIL and overwrites all the other UniqueKey on this RelOptInfo
+ * since this one has strongest semantics.
+ */
+void
+add_uniquekey_for_onerow(RelOptInfo *rel)
+{
+	/*
+	 * We overwrite the previous UniqueKey on purpose since this one has the
+	 * strongest semantic.
+	 */
+	rel->uniquekeys = list_make1(makeUniqueKey(NIL, false));
+}
+
+
+/*
+ * initililze_uniquecontext_for_joinrel
+ * Return a List of UniqueKeyContext for an inputrel
+ */
+static List *
+initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel)
+{
+	List	*res = NIL;
+	ListCell *lc;
+	foreach(lc,  inputrel->uniquekeys)
+	{
+		UniqueKeyContext context;
+		context = palloc(sizeof(struct UniqueKeyContextData));
+		context->uniquekey = lfirst_node(UniqueKey, lc);
+		context->added_to_joinrel = false;
+		context->useful = true;
+		res = lappend(res, context);
+	}
+	return res;
+}
+
+
+/*
+ * get_exprs_from_uniquekey
+ *	Unify the way of get List of exprs from a one-row UniqueKey or
+ * normal UniqueKey. for the onerow case, every expr in rel1 is a valid
+ * UniqueKey. Return a List of exprs.
+ *
+ * rel1: The relation which you want to get the exprs.
+ * ukey: The UniqueKey you want to get the exprs.
+ */
+static List *
+get_exprs_from_uniquekey(RelOptInfo *joinrel, RelOptInfo *rel1, UniqueKey *ukey)
+{
+	ListCell *lc;
+	bool onerow = rel1 != NULL && relation_is_onerow(rel1);
+
+	List	*res = NIL;
+	Assert(onerow || ukey);
+	if (onerow)
+	{
+		/* Only cares about the exprs still exist in joinrel */
+		foreach(lc, joinrel->reltarget->exprs)
+		{
+			Bitmapset *relids = pull_varnos(lfirst(lc));
+			if (bms_is_subset(relids, rel1->relids))
+			{
+				res = lappend(res, list_make1(lfirst(lc)));
+			}
+		}
+	}
+	else
+	{
+		res = list_make1(ukey->exprs);
+	}
+	return res;
+}
+
+/*
+ * Partitioned table Unique Keys.
+ * The partition table unique key is maintained as:
+ * 1. The index must be unique as usual.
+ * 2. The index must contains partition key.
+ * 3. The index must exist on all the child rel. see simple_indexinfo_equal for
+ *    how we compare it.
+ */
+
+/*
+ * index_constains_partkey
+ * return true if the index contains the partiton key.
+ */
+static bool
+index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind)
+{
+	ListCell	*lc;
+	int	i;
+	Assert(IS_PARTITIONED_REL(partrel));
+	Assert(partrel->part_scheme->partnatts > 0);
+
+	for(i = 0; i < partrel->part_scheme->partnatts; i++)
+	{
+		Node *part_expr = linitial(partrel->partexprs[i]);
+		bool found_in_index = false;
+		foreach(lc, ind->indextlist)
+		{
+			Expr *index_expr = lfirst_node(TargetEntry, lc)->expr;
+			if (equal(index_expr, part_expr))
+			{
+				found_in_index = true;
+				break;
+			}
+		}
+		if (!found_in_index)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * simple_indexinfo_equal
+ *
+ * Used to check if the 2 index is same as each other. The index here
+ * is COPIED from childrel and did some tiny changes(see
+ * simple_copy_indexinfo_to_parent)
+ */
+static bool
+simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2)
+{
+	Size oid_cmp_len = sizeof(Oid) * ind1->ncolumns;
+
+	return ind1->ncolumns == ind2->ncolumns &&
+		ind1->unique == ind2->unique &&
+		memcmp(ind1->indexkeys, ind2->indexkeys, sizeof(int) * ind1->ncolumns) == 0 &&
+		memcmp(ind1->opfamily, ind2->opfamily, oid_cmp_len) == 0 &&
+		memcmp(ind1->opcintype, ind2->opcintype, oid_cmp_len) == 0 &&
+		memcmp(ind1->sortopfamily, ind2->sortopfamily, oid_cmp_len) == 0 &&
+		equal(get_tlist_exprs(ind1->indextlist, true),
+			  get_tlist_exprs(ind2->indextlist, true));
+}
+
+
+/*
+ * The below macros are used for simple_copy_indexinfo_to_parent which is so
+ * customized that I don't want to put it to copyfuncs.c. So copy it here.
+ */
+#define COPY_POINTER_FIELD(fldname, sz) \
+	do { \
+		Size	_size = (sz); \
+		newnode->fldname = palloc(_size); \
+		memcpy(newnode->fldname, from->fldname, _size); \
+	} while (0)
+
+#define COPY_NODE_FIELD(fldname) \
+	(newnode->fldname = copyObjectImpl(from->fldname))
+
+#define COPY_SCALAR_FIELD(fldname) \
+	(newnode->fldname = from->fldname)
+
+
+/*
+ * simple_copy_indexinfo_to_parent (from partition)
+ * Copy the IndexInfo from child relation to parent relation with some modification,
+ * which is used to test:
+ * 1. If the same index exists in all the childrels.
+ * 2. If the parentrel->reltarget/basicrestrict info matches this index.
+ */
+static IndexOptInfo *
+simple_copy_indexinfo_to_parent(PlannerInfo *root,
+								RelOptInfo *parentrel,
+								IndexOptInfo *from)
+{
+	IndexOptInfo *newnode = makeNode(IndexOptInfo);
+	AppendRelInfo *appinfo = find_appinfo_by_child(root, from->rel->relid);
+	ListCell	*lc;
+	int	idx = 0;
+
+	COPY_SCALAR_FIELD(ncolumns);
+	COPY_SCALAR_FIELD(nkeycolumns);
+	COPY_SCALAR_FIELD(unique);
+	COPY_SCALAR_FIELD(immediate);
+	/* We just need to know if it is NIL or not */
+	COPY_SCALAR_FIELD(indpred);
+	COPY_SCALAR_FIELD(predOK);
+	COPY_POINTER_FIELD(indexkeys, from->ncolumns * sizeof(int));
+	COPY_POINTER_FIELD(indexcollations, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opfamily, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opcintype, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(sortopfamily, from->ncolumns * sizeof(Oid));
+	COPY_NODE_FIELD(indextlist);
+
+	/* Convert index exprs on child expr to expr on parent */
+	foreach(lc, newnode->indextlist)
+	{
+		TargetEntry *tle = lfirst_node(TargetEntry, lc);
+		/* Index on expression is ignored */
+		Assert(IsA(tle->expr, Var));
+		tle->expr = (Expr *) find_parent_var(appinfo, (Var *) tle->expr);
+		newnode->indexkeys[idx] = castNode(Var, tle->expr)->varattno;
+		idx++;
+	}
+	newnode->rel = parentrel;
+	return newnode;
+}
+
+/*
+ * adjust_partition_unique_indexlist
+ *
+ * global_unique_indexes: At the beginning, it contains the copy & modified
+ * unique index from the first partition. And then check if each index in it still
+ * exists in the following partitions. If no, remove it. at last, it has an
+ * index list which exists in all the partitions.
+ */
+static void
+adjust_partition_unique_indexlist(PlannerInfo *root,
+								  RelOptInfo *parentrel,
+								  RelOptInfo *childrel,
+								  List **global_unique_indexes)
+{
+	ListCell	*lc, *lc2;
+	foreach(lc, *global_unique_indexes)
+	{
+		IndexOptInfo	*g_ind = lfirst_node(IndexOptInfo, lc);
+		bool found_in_child = false;
+
+		foreach(lc2, childrel->indexlist)
+		{
+			IndexOptInfo   *p_ind = lfirst_node(IndexOptInfo, lc2);
+			IndexOptInfo   *p_ind_copy;
+			if (!p_ind->unique || !p_ind->immediate ||
+				(p_ind->indpred != NIL && !p_ind->predOK))
+				continue;
+			p_ind_copy = simple_copy_indexinfo_to_parent(root, parentrel, p_ind);
+			if (simple_indexinfo_equal(p_ind_copy, g_ind))
+			{
+				found_in_child = true;
+				break;
+			}
+		}
+		if (!found_in_child)
+			/* The index doesn't exist in childrel, remove it from global_unique_indexes */
+			*global_unique_indexes = foreach_delete_current(*global_unique_indexes, lc);
+	}
+}
+
+/* Helper function for groupres/distinctrel */
+static void
+add_uniquekey_from_sortgroups(PlannerInfo *root, RelOptInfo *rel, List *sortgroups)
+{
+	Query *parse = root->parse;
+	List	*exprs;
+
+	/*
+	 * XXX: If there are some vars which is not in current levelsup, the semantic is
+	 * imprecise, should we avoid it or not? levelsup = 1 is just a demo, maybe we need to
+	 * check every level other than 0, if so, looks we have to write another
+	 * pull_var_walker.
+	 */
+	List	*upper_vars = pull_vars_of_level((Node*)sortgroups, 1);
+
+	if (upper_vars != NIL)
+		return;
+
+	exprs = get_sortgrouplist_exprs(sortgroups, parse->targetList);
+	rel->uniquekeys = lappend(rel->uniquekeys,
+							  makeUniqueKey(exprs,
+											false /* sortgroupclause can't be multi_nullvals */));
+}
+
+
+/*
+ * add_combined_uniquekey
+ * The combination of both UniqueKeys is a valid UniqueKey for joinrel no matter
+ * the jointype.
+ */
+bool
+add_combined_uniquekey(RelOptInfo *joinrel,
+					   RelOptInfo *outer_rel,
+					   RelOptInfo *inner_rel,
+					   UniqueKey *outer_ukey,
+					   UniqueKey *inner_ukey,
+					   JoinType jointype)
+{
+
+	ListCell	*lc1, *lc2;
+
+	/* Either side has multi_nullvals or we have outer join,
+	 * the combined UniqueKey has multi_nullvals */
+	bool multi_nullvals = outer_ukey->multi_nullvals ||
+		inner_ukey->multi_nullvals || IS_OUTER_JOIN(jointype);
+
+	/* The only case we can get onerow joinrel after join */
+	if  (relation_is_onerow(outer_rel)
+		 && relation_is_onerow(inner_rel)
+		 && jointype == JOIN_INNER)
+	{
+		add_uniquekey_for_onerow(joinrel);
+		return true;
+	}
+
+	foreach(lc1, get_exprs_from_uniquekey(joinrel, outer_rel, outer_ukey))
+	{
+		foreach(lc2, get_exprs_from_uniquekey(joinrel, inner_rel, inner_ukey))
+		{
+			List *exprs = list_concat_copy(lfirst_node(List, lc1), lfirst_node(List, lc2));
+			joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+										  makeUniqueKey(exprs,
+														multi_nullvals));
+		}
+	}
+	return false;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b406d41e91..0551ae0512 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2389,6 +2389,8 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 		add_path(final_rel, path);
 	}
 
+	simple_copy_uniquekeys(current_rel, final_rel);
+
 	/*
 	 * Generate partial paths for final_rel, too, if outer query levels might
 	 * be able to make use of them.
@@ -3899,6 +3901,8 @@ create_grouping_paths(PlannerInfo *root,
 	}
 
 	set_cheapest(grouped_rel);
+
+	populate_grouprel_uniquekeys(root, grouped_rel, input_rel);
 	return grouped_rel;
 }
 
@@ -4615,7 +4619,7 @@ create_window_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(window_rel);
-
+	simple_copy_uniquekeys(input_rel, window_rel);
 	return window_rel;
 }
 
@@ -4911,7 +4915,7 @@ create_distinct_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(distinct_rel);
-
+	populate_distinctrel_uniquekeys(root, input_rel, distinct_rel);
 	return distinct_rel;
 }
 
@@ -5172,6 +5176,8 @@ create_ordered_paths(PlannerInfo *root,
 	 */
 	Assert(ordered_rel->pathlist != NIL);
 
+	simple_copy_uniquekeys(input_rel, ordered_rel);
+
 	return ordered_rel;
 }
 
@@ -6049,6 +6055,9 @@ adjust_paths_for_srfs(PlannerInfo *root, RelOptInfo *rel,
 	if (list_length(targets) == 1)
 		return;
 
+	/* UniqueKey is not valid after handling the SRF. */
+	rel->uniquekeys = NIL;
+
 	/*
 	 * Stack SRF-evaluation nodes atop each path for the rel.
 	 *
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 951aed80e7..e94e92937c 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -689,6 +689,8 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/* Undo effects of possibly forcing tuple_fraction to 0 */
 	root->tuple_fraction = save_fraction;
 
+	/* Add the UniqueKeys */
+	populate_unionrel_uniquekeys(root, result_rel);
 	return result_rel;
 }
 
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index d722063cf3..44c37ecffc 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -746,3 +746,47 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
 	}
 	return appinfos;
 }
+
+/*
+ * find_appinfo_by_child
+ *
+ */
+AppendRelInfo *
+find_appinfo_by_child(PlannerInfo *root, Index child_index)
+{
+	ListCell	*lc;
+	foreach(lc, root->append_rel_list)
+	{
+		AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
+		if (appinfo->child_relid == child_index)
+			return appinfo;
+	}
+	elog(ERROR, "parent relation cant be found");
+	return NULL;
+}
+
+/*
+ * find_parent_var
+ *
+ */
+Var *
+find_parent_var(AppendRelInfo *appinfo, Var *child_var)
+{
+	ListCell	*lc;
+	Var	*res = NULL;
+	Index attno = 1;
+	foreach(lc, appinfo->translated_vars)
+	{
+		Node *child_node = lfirst(lc);
+		if (equal(child_node, child_var))
+		{
+			res = copyObject(child_var);
+			res->varattno = attno;
+			res->varno = appinfo->parent_relid;
+		}
+		attno++;
+	}
+	if (res == NULL)
+		elog(ERROR, "parent var can't be found.");
+	return res;
+}
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 3132fd35a5..d66b40ec50 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -736,6 +736,7 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 		{
 			Node	   *onecq = (Node *) lfirst(lc2);
 			bool		pseudoconstant;
+			RestrictInfo	*child_rinfo;
 
 			/* check for pseudoconstant (no Vars or volatile functions) */
 			pseudoconstant =
@@ -747,13 +748,14 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 				root->hasPseudoConstantQuals = true;
 			}
 			/* reconstitute RestrictInfo with appropriate properties */
-			childquals = lappend(childquals,
-								 make_restrictinfo((Expr *) onecq,
-												   rinfo->is_pushed_down,
-												   rinfo->outerjoin_delayed,
-												   pseudoconstant,
-												   rinfo->security_level,
-												   NULL, NULL, NULL));
+			child_rinfo =  make_restrictinfo((Expr *) onecq,
+											 rinfo->is_pushed_down,
+											 rinfo->outerjoin_delayed,
+											 pseudoconstant,
+											 rinfo->security_level,
+											 NULL, NULL, NULL);
+			child_rinfo->mergeopfamilies = rinfo->mergeopfamilies;
+			childquals = lappend(childquals, child_rinfo);
 			/* track minimum security level among child quals */
 			cq_min_security = Min(cq_min_security, rinfo->security_level);
 		}
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 31d9aedeeb..c83f17acb7 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -16,6 +16,7 @@
 
 #include "nodes/execnodes.h"
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 
 extern A_Expr *makeA_Expr(A_Expr_Kind kind, List *name,
@@ -105,4 +106,6 @@ extern GroupingSet *makeGroupingSet(GroupingSetKind kind, List *content, int loc
 
 extern VacuumRelation *makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols);
 
+extern UniqueKey* makeUniqueKey(List *exprs, bool multi_nullvals);
+
 #endif							/* MAKEFUNC_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 381d84b4e4..41110ed888 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -264,6 +264,7 @@ typedef enum NodeTag
 	T_EquivalenceMember,
 	T_PathKey,
 	T_PathTarget,
+	T_UniqueKey,
 	T_RestrictInfo,
 	T_IndexClause,
 	T_PlaceHolderVar,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 9e3ebd488a..02e4458bef 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -730,6 +730,7 @@ typedef struct RelOptInfo
 	QualCost	baserestrictcost;	/* cost of evaluating the above */
 	Index		baserestrict_min_security;	/* min security_level found in
 											 * baserestrictinfo */
+	List	   *uniquekeys;		/* List of UniqueKey */
 	List	   *joininfo;		/* RestrictInfo structures for join clauses
 								 * involving this rel */
 	bool		has_eclass_joins;	/* T means joininfo is incomplete */
@@ -1047,6 +1048,28 @@ typedef struct PathKey
 } PathKey;
 
 
+/*
+ * UniqueKey
+ *
+ * Represents the unique properties held by a RelOptInfo.
+ *
+ * exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+ * is a special case of UniqueKey, which means there is only 1 row in that
+ * relation.
+ * multi_nullvals: true means multi null values may exist in these exprs, so the
+ * uniqueness is not guaranteed in this case. This field is necessary for
+ * remove_useless_join & reduce_unique_semijoins where we don't mind these
+ * duplicated NULL values. It is set to true for 2 cases. One is a unique key
+ * from a unique index but the related column is nullable. The other one is for
+ * outer join. see populate_joinrel_uniquekeys for detail.
+ */
+typedef struct UniqueKey
+{
+	NodeTag		type;
+	List	   *exprs;
+	bool		multi_nullvals;
+} UniqueKey;
+
 /*
  * PathTarget
  *
@@ -2473,7 +2496,7 @@ typedef enum
  *
  * flags indicating what kinds of grouping are possible.
  * partial_costs_set is true if the agg_partial_costs and agg_final_costs
- * 		have been initialized.
+ *		have been initialized.
  * agg_partial_costs gives partial aggregation costs.
  * agg_final_costs gives finalization costs.
  * target_parallel_safe is true if target is parallel safe.
@@ -2503,8 +2526,8 @@ typedef struct
  * limit_tuples is an estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate.
  * count_est and offset_est are the estimated values of the LIMIT and OFFSET
- * 		expressions computed by preprocess_limit() (see comments for
- * 		preprocess_limit() for more information).
+ *		expressions computed by preprocess_limit() (see comments for
+ *		preprocess_limit() for more information).
  */
 typedef struct
 {
diff --git a/src/include/nodes/pg_list.h b/src/include/nodes/pg_list.h
index 14ea2766ad..621f54a9f8 100644
--- a/src/include/nodes/pg_list.h
+++ b/src/include/nodes/pg_list.h
@@ -528,6 +528,8 @@ extern bool list_member_ptr(const List *list, const void *datum);
 extern bool list_member_int(const List *list, int datum);
 extern bool list_member_oid(const List *list, Oid datum);
 
+extern bool list_is_subset(const List *members, const List *target);
+
 extern List *list_delete(List *list, void *datum);
 extern List *list_delete_ptr(List *list, void *datum);
 extern List *list_delete_int(List *list, int datum);
diff --git a/src/include/optimizer/appendinfo.h b/src/include/optimizer/appendinfo.h
index d6a27a60dd..e87c92a054 100644
--- a/src/include/optimizer/appendinfo.h
+++ b/src/include/optimizer/appendinfo.h
@@ -32,4 +32,7 @@ extern Relids adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
 extern AppendRelInfo **find_appinfos_by_relids(PlannerInfo *root,
 											   Relids relids, int *nappinfos);
 
+extern AppendRelInfo *find_appinfo_by_child(PlannerInfo *root, Index child_index);
+extern Var *find_parent_var(AppendRelInfo *appinfo, Var *child_var);
+
 #endif							/* APPENDINFO_H */
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index 3e4171056e..9445141263 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -23,6 +23,7 @@
 #define OPTIMIZER_H
 
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 /*
  * We don't want to include nodes/pathnodes.h here, because non-planner
@@ -156,6 +157,7 @@ extern TargetEntry *get_sortgroupref_tle(Index sortref,
 										 List *targetList);
 extern TargetEntry *get_sortgroupclause_tle(SortGroupClause *sgClause,
 											List *targetList);
+extern Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 extern Node *get_sortgroupclause_expr(SortGroupClause *sgClause,
 									  List *targetList);
 extern List *get_sortgrouplist_exprs(List *sgClauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 10b6e81079..9217a8d6c6 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -240,5 +240,48 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 									   int strategy, bool nulls_first);
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 									List *live_childrels);
+extern List *select_mergejoin_clauses(PlannerInfo *root,
+									  RelOptInfo *joinrel,
+									  RelOptInfo *outerrel,
+									  RelOptInfo *innerrel,
+									  List *restrictlist,
+									  JoinType jointype,
+									  bool *mergejoin_allowed);
+
+/*
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ */
+extern void populate_baserel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *baserel,
+										List* unique_index_list);
+extern void populate_partitionedrel_uniquekeys(PlannerInfo *root,
+												RelOptInfo *rel,
+												List *childrels);
+extern void populate_distinctrel_uniquekeys(PlannerInfo *root,
+											RelOptInfo *inputrel,
+											RelOptInfo *distinctrel);
+extern void populate_grouprel_uniquekeys(PlannerInfo *root,
+										 RelOptInfo *grouprel,
+										 RelOptInfo *inputrel);
+extern void populate_unionrel_uniquekeys(PlannerInfo *root,
+										  RelOptInfo *unionrel);
+extern void simple_copy_uniquekeys(RelOptInfo *oldrel,
+								   RelOptInfo *newrel);
+extern void convert_subquery_uniquekeys(PlannerInfo *root,
+										RelOptInfo *currel,
+										RelOptInfo *sub_final_rel);
+extern void populate_joinrel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *joinrel,
+										RelOptInfo *rel1,
+										RelOptInfo *rel2,
+										List *restrictlist,
+										JoinType jointype);
+
+extern bool relation_has_uniquekeys_for(PlannerInfo *root,
+										RelOptInfo *rel,
+										List *exprs,
+										bool allow_multinulls);
+extern bool relation_is_onerow(RelOptInfo *rel);
 
 #endif							/* PATHS_H */
-- 
2.25.0

0003-Extend-UniqueKeys.patchapplication/octet-stream; name=0003-Extend-UniqueKeys.patchDownload

From 6647da1f1c6886fba396babcafad2c3f192ea60e Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:33:56 +0200
Subject: [PATCH 3/7] Extend UniqueKeys

Prepares index skip scan implementation using UniqueKeys. Allows to
specify what are the "requested" keys that should be unique, and add
them to necessary Paths to make them useful later.

Proposed by David Rowley, contains few bits out of previous version from
Jesper Pedersen.
---
 src/backend/optimizer/path/pathkeys.c   | 59 +++++++++++++++++++++
 src/backend/optimizer/path/uniquekeys.c | 68 +++++++++++++++++++++++++
 src/backend/optimizer/plan/planner.c    | 38 +++++++++++++-
 src/backend/optimizer/util/pathnode.c   | 32 +++++++++---
 src/include/nodes/pathnodes.h           |  5 ++
 src/include/optimizer/pathnode.h        |  1 +
 src/include/optimizer/paths.h           |  8 +++
 7 files changed, 201 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 7e596d4194..a4fc4f252d 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
 #include "utils/lsyscache.h"
 
 
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
@@ -95,6 +96,29 @@ make_canonical_pathkey(PlannerInfo *root,
 	return pk;
 }
 
+/*
+ * pathkey_is_unique
+ *	   Checks if the new pathkey's equivalence class is the same as that of
+ *     any existing member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+	EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+	ListCell   *lc;
+
+	/* If same EC already is already in the list, then not unique */
+	foreach(lc, pathkeys)
+	{
+		PathKey    *old_pathkey = (PathKey *) lfirst(lc);
+
+		if (new_ec == old_pathkey->pk_eclass)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * pathkey_is_redundant
  *	   Is a pathkey redundant with one already in the given list?
@@ -1151,6 +1175,41 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
 	return pathkeys;
 }
 
+/*
+ * make_pathkeys_for_uniquekeyclauses
+ *		Generate a pathkeys list to be used for uniquekey clauses
+ */
+List *
+make_pathkeys_for_uniquekeys(PlannerInfo *root,
+							 List *sortclauses,
+							 List *tlist)
+{
+	List	   *pathkeys = NIL;
+	ListCell   *l;
+
+	foreach(l, sortclauses)
+	{
+		SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+		Expr	   *sortkey;
+		PathKey    *pathkey;
+
+		sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+		Assert(OidIsValid(sortcl->sortop));
+		pathkey = make_pathkey_from_sortop(root,
+										   sortkey,
+										   root->nullable_baserels,
+										   sortcl->sortop,
+										   sortcl->nulls_first,
+										   sortcl->tleSortGroupRef,
+										   true);
+
+		if (pathkey_is_unique(pathkey, pathkeys))
+			pathkeys = lappend(pathkeys, pathkey);
+	}
+
+	return pathkeys;
+}
+
 /****************************************************************************
  *		PATHKEYS AND MERGECLAUSES
  ****************************************************************************/
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
index b33bcd2f32..7d9c0aefbf 100644
--- a/src/backend/optimizer/path/uniquekeys.c
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -1129,3 +1129,71 @@ add_combined_uniquekey(RelOptInfo *joinrel,
 	}
 	return false;
 }
+
+List*
+build_uniquekeys(PlannerInfo *root, List *sortclauses)
+{
+	List *result = NIL;
+	List *sortkeys;
+	ListCell *l;
+
+	sortkeys = make_pathkeys_for_uniquekeys(root,
+											sortclauses,
+											root->processed_tlist);
+
+	/* Create a uniquekey and add it to the list */
+	foreach(l, sortkeys)
+	{
+		PathKey    *pathkey = (PathKey *) lfirst(l);
+		EquivalenceClass *ec = pathkey->pk_eclass;
+		ListCell   *k;
+		List *exprs = NIL;
+
+		foreach(k, ec->ec_members)
+		{
+			EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
+			exprs = lappend(exprs, mem->em_expr);
+		}
+
+		result = lappend(result, makeUniqueKey(exprs, false, false));
+	}
+
+	return result;
+}
+
+bool
+query_has_uniquekeys_for(PlannerInfo *root, List *pathkeys,
+						 bool allow_multinulls)
+{
+	ListCell *lc;
+	List *exprs = NIL;
+
+	/* For UniqueKey->onerow case, the uniquekey->exprs is empty as well
+	 * so we can't rely on list_is_subset to handle this special cases
+	 */
+	if (pathkeys == NIL)
+		return false;
+
+	foreach(lc, pathkeys)
+	{
+		PathKey    *pathkey = (PathKey *) lfirst(lc);
+		EquivalenceClass *ec = pathkey->pk_eclass;
+		ListCell   *k;
+
+		foreach(k, ec->ec_members)
+		{
+			EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
+			exprs = lappend(exprs, mem->em_expr);
+		}
+	}
+
+	foreach(lc, root->query_uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		if (ukey->multi_nullvals && !allow_multinulls)
+			continue;
+		if (list_is_subset(ukey->exprs, exprs))
+			return true;
+	}
+	return false;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 0551ae0512..a5f3c28fb5 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3663,9 +3663,17 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 		root->query_pathkeys = root->window_pathkeys;
 	else if (list_length(root->distinct_pathkeys) >
 			 list_length(root->sort_pathkeys))
+	{
 		root->query_pathkeys = root->distinct_pathkeys;
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else if (root->sort_pathkeys)
+	{
 		root->query_pathkeys = root->sort_pathkeys;
+
+		if (root->distinct_pathkeys)
+			root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
 		root->query_pathkeys = NIL;
 }
@@ -4815,13 +4823,19 @@ create_distinct_paths(PlannerInfo *root,
 			Path	   *path = (Path *) lfirst(lc);
 
 			if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
-			{
 				add_path(distinct_rel, (Path *)
 						 create_upper_unique_path(root, distinct_rel,
 												  path,
 												  list_length(root->distinct_pathkeys),
 												  numDistinctRows));
-			}
+		}
+
+		foreach(lc, input_rel->unique_pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+
+			if (query_has_uniquekeys_for(root, needed_pathkeys, false))
+				add_path(distinct_rel, path);
 		}
 
 		/* For explicit-sort case, always use the more rigorous clause */
@@ -7517,6 +7531,26 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		}
 	}
 
+	foreach(lc, rel->unique_pathlist)
+	{
+		Path	   *subpath = (Path *) lfirst(lc);
+
+		/* Shouldn't have any parameterized paths anymore */
+		Assert(subpath->param_info == NULL);
+
+		if (tlist_same_exprs)
+			subpath->pathtarget->sortgrouprefs =
+				scanjoin_target->sortgrouprefs;
+		else
+		{
+			Path	   *newpath;
+
+			newpath = (Path *) create_projection_path(root, rel, subpath,
+													  scanjoin_target);
+			lfirst(lc) = newpath;
+		}
+	}
+
 	/*
 	 * Now, if final scan/join target contains SRFs, insert ProjectSetPath(s)
 	 * atop each existing path.  (Note that this function doesn't look at the
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e845a4b1ae..60656bdca5 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -416,10 +416,10 @@ set_cheapest(RelOptInfo *parent_rel)
  * 'parent_rel' is the relation entry to which the path corresponds.
  * 'new_path' is a potential path for parent_rel.
  *
- * Returns nothing, but modifies parent_rel->pathlist.
+ * Returns modified pathlist.
  */
-void
-add_path(RelOptInfo *parent_rel, Path *new_path)
+static List *
+add_path_to(RelOptInfo *parent_rel, List *pathlist, Path *new_path)
 {
 	bool		accept_new = true;	/* unless we find a superior old path */
 	int			insert_at = 0;	/* where to insert new item */
@@ -440,7 +440,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	 * for more than one old path to be tossed out because new_path dominates
 	 * it.
 	 */
-	foreach(p1, parent_rel->pathlist)
+	foreach(p1, pathlist)
 	{
 		Path	   *old_path = (Path *) lfirst(p1);
 		bool		remove_old = false; /* unless new proves superior */
@@ -584,8 +584,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		 */
 		if (remove_old)
 		{
-			parent_rel->pathlist = foreach_delete_current(parent_rel->pathlist,
-														  p1);
+			pathlist = foreach_delete_current(pathlist, p1);
 
 			/*
 			 * Delete the data pointed-to by the deleted cell, if possible
@@ -612,8 +611,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	if (accept_new)
 	{
 		/* Accept the new path: insert it at proper place in pathlist */
-		parent_rel->pathlist =
-			list_insert_nth(parent_rel->pathlist, insert_at, new_path);
+		pathlist = list_insert_nth(pathlist, insert_at, new_path);
 	}
 	else
 	{
@@ -621,6 +619,23 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		if (!IsA(new_path, IndexPath))
 			pfree(new_path);
 	}
+
+	return pathlist;
+}
+
+void
+add_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->pathlist = add_path_to(parent_rel,
+									   parent_rel->pathlist, new_path);
+}
+
+void
+add_unique_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->unique_pathlist = add_path_to(parent_rel,
+											  parent_rel->unique_pathlist,
+											  new_path);
 }
 
 /*
@@ -2539,6 +2554,7 @@ create_projection_path(PlannerInfo *root,
 	pathnode->path.pathkeys = subpath->pathkeys;
 
 	pathnode->subpath = subpath;
+	pathnode->path.uniquekeys = subpath->uniquekeys;
 
 	/*
 	 * We might not need a separate Result node.  If the input plan node type
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 02e4458bef..a5c406bd4e 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -297,6 +297,7 @@ struct PlannerInfo
 
 	List	   *query_pathkeys; /* desired pathkeys for query_planner() */
 
+	List	   *query_uniquekeys; /* unique keys required for the query */
 	List	   *group_pathkeys; /* groupClause pathkeys, if any */
 	List	   *window_pathkeys;	/* pathkeys of bottom window, if any */
 	List	   *distinct_pathkeys;	/* distinctClause pathkeys, if any */
@@ -679,6 +680,7 @@ typedef struct RelOptInfo
 	List	   *pathlist;		/* Path structures */
 	List	   *ppilist;		/* ParamPathInfos used in pathlist */
 	List	   *partial_pathlist;	/* partial Paths */
+	List       *unique_pathlist;    /* unique Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
 	struct Path *cheapest_unique_path;
@@ -866,6 +868,7 @@ struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanskip;		/* can AM skip duplicate values? */
 	bool		amcanparallel;	/* does AM support parallel scan? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
 	void		(*amcostestimate) ();	/* AM's cost estimator */
@@ -1182,6 +1185,8 @@ typedef struct Path
 
 	List	   *pathkeys;		/* sort ordering of path's output */
 	/* pathkeys is a List of PathKey nodes; see above */
+
+	List	   *uniquekeys;	/* the unique keys, or NIL if none */
 } Path;
 
 /* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 715a24ad29..6796ad8cb7 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -27,6 +27,7 @@ extern int	compare_fractional_path_costs(Path *path1, Path *path2,
 										  double fraction);
 extern void set_cheapest(RelOptInfo *parent_rel);
 extern void add_path(RelOptInfo *parent_rel, Path *new_path);
+extern void add_unique_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_path_precheck(RelOptInfo *parent_rel,
 							  Cost startup_cost, Cost total_cost,
 							  List *pathkeys, Relids required_outer);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9217a8d6c6..0cb8030e33 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -215,6 +215,9 @@ extern List *build_join_pathkeys(PlannerInfo *root,
 extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
 										   List *sortclauses,
 										   List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+										  List *sortclauses,
+										  List *tlist);
 extern void initialize_mergeclause_eclasses(PlannerInfo *root,
 											RestrictInfo *restrictinfo);
 extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -282,6 +285,11 @@ extern bool relation_has_uniquekeys_for(PlannerInfo *root,
 										RelOptInfo *rel,
 										List *exprs,
 										bool allow_multinulls);
+extern bool query_has_uniquekeys_for(PlannerInfo *root,
+									 List *exprs,
+									 bool allow_multinulls);
 extern bool relation_is_onerow(RelOptInfo *rel);
 
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+
 #endif							/* PATHS_H */
-- 
2.25.0

0004-Index-skip-scan.patchapplication/octet-stream; name=0004-Index-skip-scan.patchDownload

From 9b853b827e8c27d5fbc89692b66f00ba32fcd8f5 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:39:13 +0200
Subject: [PATCH 4/7] Index skip scan

Implementation of basic Index Skip Scan (see Loose Index Scan in the
wiki [1]) infrastructure on top of IndexOnlyScan and IndexScan.
Introduces a new index am function amskip to allow advance past
duplicate key values in a scan. This innocently looking description
could be a bit tricky on the edge between am specific and common parts
of the implementation, mostly due to different information available at
each level, e.g. visibility. This means the common parts should apply
skipping multiple times if necessary.

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 contrib/bloom/blutils.c                       |  1 +
 doc/src/sgml/indexam.sgml                     |  1 +
 src/backend/access/brin/brin.c                |  1 +
 src/backend/access/gin/ginutil.c              |  1 +
 src/backend/access/gist/gist.c                |  1 +
 src/backend/access/hash/hash.c                |  1 +
 src/backend/access/index/indexam.c            | 18 ++++
 src/backend/access/spgist/spgutils.c          |  1 +
 src/backend/commands/explain.c                | 25 +++++
 src/backend/executor/nodeIndexonlyscan.c      | 97 ++++++++++++++++++-
 src/backend/executor/nodeIndexscan.c          | 56 ++++++++++-
 src/backend/nodes/copyfuncs.c                 |  2 +
 src/backend/nodes/outfuncs.c                  |  2 +
 src/backend/nodes/readfuncs.c                 |  2 +
 src/backend/optimizer/path/costsize.c         |  1 +
 src/backend/optimizer/path/indxpath.c         | 96 ++++++++++++++++++
 src/backend/optimizer/plan/createplan.c       | 20 ++--
 src/backend/optimizer/util/pathnode.c         | 64 ++++++++++++
 src/backend/optimizer/util/plancat.c          |  1 +
 src/backend/utils/misc/guc.c                  |  9 ++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 src/include/access/amapi.h                    |  8 ++
 src/include/access/genam.h                    |  2 +
 src/include/access/sdir.h                     |  7 ++
 src/include/nodes/execnodes.h                 |  6 ++
 src/include/nodes/pathnodes.h                 |  4 +
 src/include/nodes/plannodes.h                 |  4 +
 src/include/optimizer/cost.h                  |  1 +
 src/include/optimizer/pathnode.h              |  4 +
 src/test/regress/expected/sysviews.out        |  3 +-
 30 files changed, 429 insertions(+), 11 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index d3bf8665df..4eec21b99e 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -134,6 +134,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = blcostestimate;
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index af87f172a7..8c552c22bd 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -150,6 +150,7 @@ typedef struct IndexAmRoutine
     amendscan_function amendscan;
     ammarkpos_function ammarkpos;       /* can be NULL */
     amrestrpos_function amrestrpos;     /* can be NULL */
+    amskip_function amskip;             /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 7db3ae5ee0..e3e09efa31 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -115,6 +115,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = brincostestimate;
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index a400f1fedb..7e83a8ee58 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -66,6 +66,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = gincostestimate;
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 79fe6eb8d6..6fe9f8f5a9 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -87,6 +87,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = gistcostestimate;
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 3ec6d528e7..85d44ee58d 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -84,6 +84,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = hashcostestimate;
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 6b9750c244..c7b70997ff 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_skip		- advance past duplicate key values in a scan
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -736,6 +737,23 @@ index_can_return(Relation indexRelation, int attno)
 	return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
 }
 
+/* ----------------
+ *		index_skip
+ *
+ *		Skip past all tuples where the first 'prefix' columns have the
+ *		same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction,
+		   ScanDirection indexdir, bool scanstart, int prefix)
+{
+	SCAN_CHECKS;
+
+	return scan->indexRelation->rd_indam->amskip(scan, direction,
+												 indexdir, scanstart, prefix);
+}
+
 /* ----------------
  *		index_getprocid
  *
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 0efe05e552..0491ea172a 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -69,6 +69,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = spgcostestimate;
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a283e4d45c..3c6258761a 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -148,6 +148,7 @@ static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
 static void ExplainIndentText(ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 
 
@@ -1096,6 +1097,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
 }
 
+/*
+ * ExplainIndexSkipScanKeys -
+ *	  Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es)
+{
+	if (skipPrefixSize > 0)
+	{
+		if (es->format != EXPLAIN_FORMAT_TEXT)
+			ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+	}
+}
+
 /*
  * ExplainNode -
  *	  Appends a description of a plan tree to es->str
@@ -1433,6 +1450,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexScan  *indexscan = (IndexScan *) plan;
 
+				ExplainIndexSkipScanKeys(indexscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexscan->indexid,
 										indexscan->indexorderdir,
 										es);
@@ -1443,6 +1462,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
 
+				ExplainIndexSkipScanKeys(indexonlyscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexonlyscan->indexid,
 										indexonlyscan->indexorderdir,
 										es);
@@ -1703,6 +1724,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_IndexScan:
+			if (((IndexScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyBool("Skip scan", true, es);
 			show_scan_qual(((IndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexScan *) plan)->indexqualorig)
@@ -1716,6 +1739,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 										   planstate, es);
 			break;
 		case T_IndexOnlyScan:
+			if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyBool("Skip scan", true, es);
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 5617ac29e7..c4e4b087a7 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -41,6 +41,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/itemptr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -62,9 +63,26 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	EState	   *estate;
 	ExprContext *econtext;
 	ScanDirection direction;
+	ScanDirection readDirection;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	ItemPointerData startTid;
+	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
+
+	/*
+	 * Tells if the current position was reached via skipping. In this case
+	 * there is no nead for the index_getnext_tid
+	 */
+	bool skipped = false;
+
+	/*
+	 * Index only scan must be aware that in case of skipping we can return to
+	 * the starting point due to visibility checks. In this situation we need
+	 * to jump further, and number of skipping attempts tell us how far do we
+	 * need to do so.
+	 */
+	int skipAttempts = 0;
 
 	/*
 	 * extract necessary information from index scan node
@@ -72,7 +90,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -114,16 +132,87 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
 	}
+	else
+	{
+		ItemPointerCopy(&scandesc->xs_heaptid, &startTid);
+	}
+
+	/*
+	 * Check if we need to skip to the next key prefix, because we've been
+	 * asked to implement DISTINCT.
+	 *
+	 * When fetching a cursor in the direction opposite to a general scan
+	 * direction, the result must be what normal fetching should have
+	 * returned, but in reversed order. In other words, return the last or
+	 * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+	 * Due to that we skip also when the first tuple wasn't emitted yet, but
+	 * the directions are opposite.
+	 */
+	if (node->ioss_SkipPrefixSize > 0 &&
+		(node->ioss_FirstTupleEmitted ||
+		 ScanDirectionsAreOpposite(direction, indexonlyscan->indexorderdir)))
+	{
+		if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir,
+						!node->ioss_FirstTupleEmitted, node->ioss_SkipPrefixSize))
+		{
+			/*
+			 * Reached end of index. At this point currPos is invalidated, and
+			 * we need to reset ioss_FirstTupleEmitted, since otherwise after
+			 * going backwards, reaching the end of index, and going forward
+			 * again we apply skip again. It would be incorrect and lead to an
+			 * extra skipped item.
+			 */
+			node->ioss_FirstTupleEmitted = false;
+			return ExecClearTuple(slot);
+		}
+		else
+		{
+			skipAttempts = 1;
+			skipped = true;
+			tid = &scandesc->xs_heaptid;
+		}
+	}
+
+	readDirection = skipped ? indexonlyscan->indexorderdir : direction;
 
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (skipped || (tid = index_getnext_tid(scandesc, readDirection)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * While doing index only skip scan with advancing and reading in
+		 * different directions we can return to the same position where we
+		 * started after visibility check. Recognize such situations and skip
+		 * more.
+		 */
+		if ((readDirection != direction) &&
+			ItemPointerIsValid(&startTid) && ItemPointerEquals(&startTid, tid))
+		{
+			int i;
+			skipAttempts += 1;
+
+			for (i = 0; i < skipAttempts; i++)
+			{
+				if (!index_skip(scandesc, direction,
+								indexonlyscan->indexorderdir,
+								!node->ioss_FirstTupleEmitted,
+								node->ioss_SkipPrefixSize))
+				{
+					node->ioss_FirstTupleEmitted = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			tid = &scandesc->xs_heaptid;
+		}
+
+		skipped = false;
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -250,6 +339,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 							  ItemPointerGetBlockNumber(tid),
 							  estate->es_snapshot);
 
+		node->ioss_FirstTupleEmitted = true;
+
 		return slot;
 	}
 
@@ -504,6 +595,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+	indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->ioss_FirstTupleEmitted = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index d0a96a38e0..449aaec3ac 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,6 +85,13 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
+
+	/*
+	 * tells if the current position was reached via skipping. In this case
+	 * there is no nead for the index_getnext_tid
+	 */
+	bool skipped = false;
 
 	/*
 	 * extract necessary information from index scan node
@@ -92,7 +99,7 @@ IndexNext(IndexScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -117,6 +124,12 @@ IndexNext(IndexScanState *node)
 
 		node->iss_ScanDesc = scandesc;
 
+		/* Index skip scan assumes xs_want_itup, so set it to true */
+		if (indexscan->indexskipprefixsize > 0)
+			node->iss_ScanDesc->xs_want_itup = true;
+		else
+			node->iss_ScanDesc->xs_want_itup = false;
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -127,12 +140,48 @@ IndexNext(IndexScanState *node)
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	}
 
+	/*
+	 * Check if we need to skip to the next key prefix, because we've been
+	 * asked to implement DISTINCT.
+	 *
+	 * When fetching a cursor in the direction opposite to a general scan
+	 * direction, the result must be what normal fetching should have
+	 * returned, but in reversed order. In other words, return the last or
+	 * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+	 * Due to that we skip also when the first tuple wasn't emitted yet, but
+	 * the directions are opposite.
+	 */
+	if (node->iss_SkipPrefixSize > 0 &&
+		(node->iss_FirstTupleEmitted ||
+		 ScanDirectionsAreOpposite(direction, indexscan->indexorderdir)))
+	{
+		if (!index_skip(scandesc, direction, indexscan->indexorderdir,
+					   !node->iss_FirstTupleEmitted, node->iss_SkipPrefixSize))
+		{
+			/*
+			 * Reached end of index. At this point currPos is invalidated, and
+			 * we need to reset iss_FirstTupleEmitted, since otherwise after
+			 * going backwards, reaching the end of index, and going forward
+			 * again we apply skip again. It would be incorrect and lead to an
+			 * extra skipped item.
+			 */
+			node->iss_FirstTupleEmitted = false;
+			return ExecClearTuple(slot);
+		}
+		else
+		{
+			skipped = true;
+			index_fetch_heap(scandesc, slot);
+		}
+	}
+
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (skipped || index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
+		skipped = false;
 
 		/*
 		 * If the index was lossy, we have to recheck the index quals using
@@ -149,6 +198,7 @@ IndexNext(IndexScanState *node)
 			}
 		}
 
+		node->iss_FirstTupleEmitted = true;
 		return slot;
 	}
 
@@ -910,6 +960,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+	indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->iss_FirstTupleEmitted = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index efe7c3ae10..236bb1435c 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -493,6 +493,7 @@ _copyIndexScan(const IndexScan *from)
 	COPY_NODE_FIELD(indexorderbyorig);
 	COPY_NODE_FIELD(indexorderbyops);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
@@ -518,6 +519,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
 	COPY_NODE_FIELD(indexorderby);
 	COPY_NODE_FIELD(indextlist);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c3a9632992..88254446ce 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -562,6 +562,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
 	WRITE_NODE_FIELD(indexorderbyorig);
 	WRITE_NODE_FIELD(indexorderbyops);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
@@ -576,6 +577,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
 	WRITE_NODE_FIELD(indexorderby);
 	WRITE_NODE_FIELD(indextlist);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 3a18571d0c..9766fc11ed 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1829,6 +1829,7 @@ _readIndexScan(void)
 	READ_NODE_FIELD(indexorderbyorig);
 	READ_NODE_FIELD(indexorderbyops);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
@@ -1848,6 +1849,7 @@ _readIndexOnlyScan(void)
 	READ_NODE_FIELD(indexorderby);
 	READ_NODE_FIELD(indextlist);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 945aa93374..fe6ef62e8f 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -125,6 +125,7 @@ int			max_parallel_workers_per_gather = 2;
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
+bool		enable_indexskipscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 2a50272da6..340bbfcbb3 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -790,6 +790,16 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexPath  *ipath = (IndexPath *) lfirst(lc);
 
+		/*
+		 * To prevent unique paths from index skip scans being potentially used
+		 * when not needed scan keep them in a separate pathlist.
+		*/
+		if (ipath->indexskipprefix != 0)
+		{
+			add_unique_path(rel, (Path *) ipath);
+			continue;
+		}
+
 		if (index->amhasgettuple)
 			add_path(rel, (Path *) ipath);
 
@@ -878,6 +888,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
+	bool		not_empty_qual = false;
+	bool		can_skip;
 	int			indexcol;
 
 	/*
@@ -1027,6 +1039,60 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	index_only_scan = (scantype != ST_BITMAPSCAN &&
 					   check_index_only(rel, index));
 
+	/* Check if an index skip scan is possible. */
+	can_skip = enable_indexskipscan & index->amcanskip;
+
+	/*
+	 * Skip scan is not supported when there are qual conditions, which are not
+	 * covered by index. The reason for that is that those conditions are
+	 * evaluated later, already after skipping was applied.
+	 *
+	 * TODO: This implementation is too restrictive, and doesn't allow e.g.
+	 * index expressions. For that we need to examine index_clauses too.
+	 */
+	if (can_skip && root->parse->jointree != NULL)
+	{
+		ListCell *lc;
+
+		foreach(lc, (List *)root->parse->jointree->quals)
+		{
+			Node *expr, *qual = (Node *) lfirst(lc);
+			Var *var;
+			bool found = false;
+
+			if (!is_opclause(qual))
+			{
+				not_empty_qual = true;
+				break;
+			}
+
+			expr = get_leftop(qual);
+
+			if (!IsA(expr, Var))
+			{
+				not_empty_qual = true;
+				break;
+			}
+
+			var = (Var *) expr;
+
+			for (int i = 0; i < index->ncolumns; i++)
+			{
+				if (index->indexkeys[i] == var->varattno)
+				{
+					found = true;
+					break;
+				}
+			}
+
+			if (!found)
+			{
+				not_empty_qual = true;
+				break;
+			}
+		}
+	}
+
 	/*
 	 * 4. Generate an indexscan path if there are relevant restriction clauses
 	 * in the current clauses, OR the index ordering is potentially useful for
@@ -1050,6 +1116,21 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  false);
 		result = lappend(result, ipath);
 
+		/* Consider index skip scan as well */
+		if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
+		{
+			ListCell   *lc;
+
+			foreach(lc, root->query_uniquekeys)
+			{
+				UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+				result = lappend(result,
+								 create_skipscan_unique_path(root, index,
+															 (Path *) ipath,
+															 ukey->exprs));
+			}
+		}
+
 		/*
 		 * If appropriate, consider parallel index scan.  We don't allow
 		 * parallel index scan for bitmap index scans.
@@ -1105,6 +1186,21 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  false);
 			result = lappend(result, ipath);
 
+			/* Consider index skip scan as well */
+			if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
+			{
+				ListCell   *lc;
+
+				foreach(lc, root->query_uniquekeys)
+				{
+					UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+					result = lappend(result,
+									 create_skipscan_unique_path(root, index,
+																 (Path *) ipath,
+																 ukey->exprs));
+				}
+			}
+
 			/* If appropriate, consider parallel index scan */
 			if (index->amcanparallel &&
 				rel->consider_parallel && outer_relids == NULL &&
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index eb9543f6ad..5b591036fb 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -177,12 +177,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 								 Oid indexid, List *indexqual, List *indexqualorig,
 								 List *indexorderby, List *indexorderbyorig,
 								 List *indexorderbyops,
-								 ScanDirection indexscandir);
+								 ScanDirection indexscandir,
+								 int skipprefix);
 static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 Index scanrelid, Oid indexid,
 										 List *indexqual, List *indexorderby,
 										 List *indextlist,
-										 ScanDirection indexscandir);
+										 ScanDirection indexscandir,
+										 int skipprefix);
 static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
 											  List *indexqual,
 											  List *indexqualorig);
@@ -3005,7 +3007,8 @@ create_indexscan_plan(PlannerInfo *root,
 												fixed_indexquals,
 												fixed_indexorderbys,
 												best_path->indexinfo->indextlist,
-												best_path->indexscandir);
+												best_path->indexscandir,
+												best_path->indexskipprefix);
 	else
 		scan_plan = (Scan *) make_indexscan(tlist,
 											qpqual,
@@ -3016,7 +3019,8 @@ create_indexscan_plan(PlannerInfo *root,
 											fixed_indexorderbys,
 											indexorderbys,
 											indexorderbyops,
-											best_path->indexscandir);
+											best_path->indexscandir,
+											best_path->indexskipprefix);
 
 	copy_generic_path_info(&scan_plan->plan, &best_path->path);
 
@@ -5283,7 +5287,8 @@ make_indexscan(List *qptlist,
 			   List *indexorderby,
 			   List *indexorderbyorig,
 			   List *indexorderbyops,
-			   ScanDirection indexscandir)
+			   ScanDirection indexscandir,
+			   int skipPrefixSize)
 {
 	IndexScan  *node = makeNode(IndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5300,6 +5305,7 @@ make_indexscan(List *qptlist,
 	node->indexorderbyorig = indexorderbyorig;
 	node->indexorderbyops = indexorderbyops;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
@@ -5312,7 +5318,8 @@ make_indexonlyscan(List *qptlist,
 				   List *indexqual,
 				   List *indexorderby,
 				   List *indextlist,
-				   ScanDirection indexscandir)
+				   ScanDirection indexscandir,
+				   int skipPrefixSize)
 {
 	IndexOnlyScan *node = makeNode(IndexOnlyScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5327,6 +5334,7 @@ make_indexonlyscan(List *qptlist,
 	node->indexorderby = indexorderby;
 	node->indextlist = indextlist;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 60656bdca5..cac2ca0b0f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2971,6 +2971,70 @@ create_upper_unique_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_skipscan_unique_path
+ *	  Creates a pathnode the same as an existing IndexPath except based on
+ *	  skipping duplicate values.  This may or may not be cheaper than using
+ *	  create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root, IndexOptInfo *index,
+							Path *basepath, List *unique_exprs)
+{
+	IndexPath 	*pathnode = makeNode(IndexPath);
+	int 		numDistinctRows;
+	int 		distinctPrefixKeys;
+	ListCell 	*lc;
+	List 	   	*exprs = NIL;
+
+
+	distinctPrefixKeys = list_length(unique_exprs);
+
+	Assert(IsA(basepath, IndexPath));
+
+	/* We don't want to modify basepath, so make a copy. */
+	memcpy(pathnode, basepath, sizeof(IndexPath));
+
+	/*
+	 * Normally we can think about distinctPrefixKeys as just
+	 * a number of distinct keys. But if lets say we have a
+	 * distinct key a, and the index contains b, a in exactly
+	 * this order. In such situation we need to use position
+	 * of a in the index as distinctPrefixKeys, otherwise skip
+	 * will happen only by the first column.
+	 */
+	foreach(lc, unique_exprs)
+	{
+		Expr *unique_expr = (Expr *) lfirst(lc);
+		Var *var = (Var *) unique_expr;
+
+		exprs = lappend(exprs, unique_expr);
+
+		for (int i = 0; i < index->ncolumns; i++)
+		{
+			if (index->indexkeys[i] == var->varattno)
+			{
+				distinctPrefixKeys = Max(i + 1, distinctPrefixKeys);
+				break;
+			}
+		}
+	}
+
+	Assert(distinctPrefixKeys > 0);
+	pathnode->indexskipprefix = distinctPrefixKeys;
+
+	numDistinctRows = estimate_num_groups(root, exprs,
+										  pathnode->path.rows,
+										  NULL);
+
+	pathnode->path.total_cost = pathnode->path.startup_cost * numDistinctRows;
+	pathnode->path.rows = numDistinctRows;
+
+	return pathnode;
+}
+
 /*
  * create_agg_path
  *	  Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 0b2f9d398a..6da71e5828 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -272,6 +272,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanskip = (amroutine->amskip != NULL);
 			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 031ca0327f..56b2cecdfd 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -952,6 +952,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-skip-scan plans."),
+			NULL
+		},
+		&enable_indexskipscan,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e430e33c7b..95def9aa34 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -355,6 +355,7 @@
 #enable_hashjoin = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexskipscan = on
 #enable_material = on
 #enable_mergejoin = on
 #enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 4325faa460..85cca4d520 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -130,6 +130,13 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+								 ScanDirection dir,
+								 ScanDirection indexdir,
+								 bool start,
+								 int prefix);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -231,6 +238,7 @@ typedef struct IndexAmRoutine
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
 	amrestrpos_function amrestrpos; /* can be NULL */
+	amskip_function amskip;				/* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 931257bd81..95ce4a5d80 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -180,6 +180,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
 												   IndexBulkDeleteResult *stats);
 extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction,
+					   ScanDirection indexdir, bool start, int prefix);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/sdir.h b/src/include/access/sdir.h
index 23feb90986..094a127464 100644
--- a/src/include/access/sdir.h
+++ b/src/include/access/sdir.h
@@ -55,4 +55,11 @@ typedef enum ScanDirection
 #define ScanDirectionIsForward(direction) \
 	((bool) ((direction) == ForwardScanDirection))
 
+/*
+ * ScanDirectionsAreOpposite
+ *		True iff scan directions are backward/forward or forward/backward.
+ */
+#define ScanDirectionsAreOpposite(dirA, dirB) \
+	((bool) (dirA != NoMovementScanDirection && dirA == -dirB))
+
 #endif							/* SDIR_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0187989fd1..cf1e729bec 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1429,6 +1429,8 @@ typedef struct IndexScanState
 	ExprContext *iss_RuntimeContext;
 	Relation	iss_RelationDesc;
 	struct IndexScanDescData *iss_ScanDesc;
+	int         iss_SkipPrefixSize;
+	bool		iss_FirstTupleEmitted;
 
 	/* These are needed for re-checking ORDER BY expr ordering */
 	pairingheap *iss_ReorderQueue;
@@ -1458,6 +1460,8 @@ typedef struct IndexScanState
  *		TableSlot		   slot for holding tuples fetched from the table
  *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
+ *		SkipPrefixSize	   number of keys for skip-based DISTINCT
+ *		FirstTupleEmitted  has the first tuple been emitted
  * ----------------
  */
 typedef struct IndexOnlyScanState
@@ -1476,6 +1480,8 @@ typedef struct IndexOnlyScanState
 	struct IndexScanDescData *ioss_ScanDesc;
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
+	int         ioss_SkipPrefixSize;
+	bool		ioss_FirstTupleEmitted;
 	Size		ioss_PscanLen;
 } IndexOnlyScanState;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index a5c406bd4e..7287594582 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1229,6 +1229,9 @@ typedef struct Path
  * we need not recompute them when considering using the same index in a
  * bitmap index/heap scan (see BitmapHeapPath).  The costs of the IndexPath
  * itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
  *----------
  */
 typedef struct IndexPath
@@ -1241,6 +1244,7 @@ typedef struct IndexPath
 	ScanDirection indexscandir;
 	Cost		indextotalcost;
 	Selectivity indexselectivity;
+	int			indexskipprefix;
 } IndexPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 83e01074ed..efafdf9361 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -409,6 +409,8 @@ typedef struct IndexScan
 	List	   *indexorderbyorig;	/* the same in original form */
 	List	   *indexorderbyops;	/* OIDs of sort ops for ORDER BY exprs */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for distinct
+										 * scans */
 } IndexScan;
 
 /* ----------------
@@ -436,6 +438,8 @@ typedef struct IndexOnlyScan
 	List	   *indexorderby;	/* list of index ORDER BY exprs */
 	List	   *indextlist;		/* TargetEntry list describing index's cols */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for distinct
+										 * scans */
 } IndexOnlyScan;
 
 /* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 613db8eab6..c0f176eaaa 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
 extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 6796ad8cb7..705deaf7bd 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -207,6 +207,10 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
 												 Path *subpath,
 												 int numCols,
 												 double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+											  IndexOptInfo *index,
+											  Path *subpath,
+											  List *unique_exprs);
 extern AggPath *create_agg_path(PlannerInfo *root,
 								RelOptInfo *rel,
 								Path *subpath,
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 06c4c3e476..4da9594925 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -79,6 +79,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexskipscan           | on
  enable_material                | on
  enable_mergejoin               | on
  enable_nestloop                | on
@@ -90,7 +91,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(18 rows)
+(20 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
-- 
2.25.0

0005-Btree-implementation-of-skipping.patchapplication/octet-stream; name=0005-Btree-implementation-of-skipping.patchDownload

From 6456f0ff5e1a80a745bfb1e79ec90721ff9495d0 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:45:08 +0200
Subject: [PATCH 5/7] Btree implementation of skipping

Implementation of Index Skip Scan for Btree. To make it suitable for
both situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 src/backend/access/nbtree/nbtree.c            |  13 +
 src/backend/access/nbtree/nbtsearch.c         | 471 ++++++++++++-
 src/include/access/nbtree.h                   |   7 +
 src/test/regress/expected/join.out            |   3 +
 src/test/regress/expected/select_distinct.out | 621 ++++++++++++++++++
 src/test/regress/sql/join.sql                 |   5 +
 src/test/regress/sql/select_distinct.sql      | 254 +++++++
 7 files changed, 1373 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e947addef6..f5ca715827 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -136,6 +136,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
+	amroutine->amskip = btskip;
 	amroutine->amcostestimate = btcostestimate;
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
@@ -385,6 +386,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	 */
 	so->currTuples = so->markTuples = NULL;
 
+	so->skipScanKey = NULL;
+
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
@@ -452,6 +455,16 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	_bt_preprocess_array_keys(scan);
 }
 
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction,
+	   ScanDirection indexdir, bool start, int prefix)
+{
+	return _bt_skip(scan, direction, indexdir, start, prefix);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index f228c87a2b..b0162a0628 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -45,7 +45,10 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+											Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+										Buffer buf, ScanDirection dir);
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -1483,6 +1486,417 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 	return true;
 }
 
+/*
+ *  _bt_skip() -- Skip items that have the same prefix as the most recently
+ * 				  fetched index tuple.
+ *
+ * 		The current position is set so that a subsequent call to _bt_next will
+ * 		fetch the first tuple that differs in the leading 'prefix' keys.
+ *
+ * 		There are four different kinds of skipping (depending on dir and
+ * 		indexdir, that are important to distinguish, especially in the presense
+ * 		of an index condition:
+ *
+ * 		* Advancing forward and reading forward
+ * 			simple scan
+ *
+ * 		* Advancing forward and reading backward
+ * 			scan inside a cursor fetching backward, when skipping is necessary
+ * 			right from the start
+ *
+ * 		* Advancing backward and reading forward
+ * 			scan with order by desc inside a cursor fetching forward, when
+ * 			skipping is necessary right from the start
+ *
+ * 		* Advancing backward and reading backward
+ * 			simple scan with order by desc
+ *
+ *      The current page is searched for the next unique value. If none is found
+ *      we will do a scan from the root in order to find the next page with
+ *      a unique value.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir,
+		 ScanDirection indexdir, bool scanstart, int prefix)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTStack stack;
+	Buffer buf;
+	OffsetNumber offnum;
+	BTScanPosItem *currItem;
+	Relation 	 indexRel = scan->indexRelation;
+
+	/* We want to return tuples, and we need a starting point */
+	Assert(scan->xs_want_itup);
+	Assert(scan->xs_itup);
+
+	if (so->numKilled > 0)
+		_bt_killitems(scan);
+
+	/* If skipScanKey is NULL then we initialize it with _bt_mkscankey */
+	if (so->skipScanKey == NULL)
+	{
+		so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+		so->skipScanKey->keysz = prefix;
+		so->skipScanKey->scantid = NULL;
+	}
+	so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+	_bt_update_skip_scankeys(scan, indexRel);
+
+	/* Check if the next unique key can be found within the current page.
+	 * Since we do not lock the current page between jumps, it's possible
+	 * that it was splitted since the last time we saw it. This is fine in
+	 * case of scanning forward, since page split to the right and we are
+	 * still on the left most page. In case of scanning backwards it's
+	 * possible to loose some pages and we need to remember the previous
+	 * page, and then follow the right link from the current page until we
+	 * find the original one.
+	 *
+	 * Since the whole idea of checking the current page is to protect
+	 * ourselves and make more performant statistic mismatch case when
+	 * there are too many distinct values for jumping, it's not clear if
+	 * the complexity of this solution in case of backward scan is
+	 * justified, so for now just avoid it.
+	 */
+	if (BufferIsValid(so->currPos.buf) && ScanDirectionIsForward(dir))
+	{
+		LockBuffer(so->currPos.buf, BT_READ);
+
+		if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+		{
+			bool keyFound = false;
+
+			offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+			/* Lock the page for SERIALIZABLE transactions */
+			PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+							  scan->xs_snapshot);
+
+			/* We know in which direction to look */
+			_bt_initialize_more_data(so, dir);
+
+			/* Now read the data */
+			keyFound = _bt_readpage(scan, dir, offnum);
+
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+
+			if (keyFound)
+			{
+				/* set IndexTuple */
+				currItem = &so->currPos.items[so->currPos.itemIndex];
+				scan->xs_heaptid = currItem->heapTid;
+				scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+				return true;
+			}
+		}
+		else
+		{
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		}
+	}
+
+	if (BufferIsValid(so->currPos.buf))
+	{
+		ReleaseBuffer(so->currPos.buf);
+		so->currPos.buf = InvalidBuffer;
+	}
+
+	/*
+	 * We haven't found scan key within the current page, so let's scan from
+	 * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+	 * number
+	 */
+	stack = _bt_search(scan->indexRelation, so->skipScanKey,
+					   &buf, BT_READ, scan->xs_snapshot);
+	_bt_freestack(stack);
+	so->currPos.buf = buf;
+	offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+	/* Lock the page for SERIALIZABLE transactions */
+	PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+					  scan->xs_snapshot);
+
+	/* We know in which direction to look */
+	_bt_initialize_more_data(so, dir);
+
+	/*
+	 * Simplest case is when both directions are forward, when we are already
+	 * at the next distinct key at the beginning of the series (so everything
+	 * else would be done in _bt_readpage)
+	 *
+	 * The case when both directions are backwards is also simple, but we need
+	 * to go one step back, since we need a last element from the previous
+	 * series.
+	 */
+	if (ScanDirectionIsBackward(dir) && ScanDirectionIsBackward(indexdir))
+		 offnum = OffsetNumberPrev(offnum);
+
+	/*
+	 * Andvance backward but read forward. At this moment we are at the next
+	 * distinct key at the beginning of the series. In case if scan just
+	 * started, we can read forward without doing anything else. Otherwise
+	 * find previous distinct key and the beginning of it's series and read
+	 * forward from there. To do so, go back one step, perform binary search
+	 * to find the first item in the series and let _bt_readpage do everything
+	 * else.
+	 */
+	else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir))
+	{
+		if (!scanstart)
+		{
+			/* Reading forward means we expect to see more data on the right */
+			so->currPos.moreRight = true;
+
+			offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+			/* One step back to find a previous value */
+			_bt_readpage(scan, dir, offnum);
+
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+			if (_bt_next(scan, dir))
+			{
+				LockBuffer(so->currPos.buf, BT_READ);
+				_bt_update_skip_scankeys(scan, indexRel);
+
+				/*
+				 * And now find the last item from the sequence for the
+				 * current, value with the intention do OffsetNumberNext. As a
+				 * result we end up on a first element from the sequence.
+				 */
+				if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+					offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+				else
+				{
+					if (BufferIsValid(so->currPos.buf))
+					{
+						/* Before leaving current page, deal with any killed items */
+						if (so->numKilled > 0)
+							_bt_killitems(scan);
+
+						LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+						ReleaseBuffer(so->currPos.buf);
+						so->currPos.buf = InvalidBuffer;
+					}
+
+					stack = _bt_search(scan->indexRelation, so->skipScanKey,
+									   &buf, BT_READ, scan->xs_snapshot);
+					_bt_freestack(stack);
+					so->currPos.buf = buf;
+					offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+				}
+			}
+			else
+			{
+				pfree(so->skipScanKey);
+				so->skipScanKey = NULL;
+				return false;
+			}
+		}
+	}
+
+	/*
+	 * Advance forward but read backward. At this moment we are at the next
+	 * distinct key at the beginning of the series. In case if scan just
+	 * started, we can go one step back and read forward without doing
+	 * anything else. Otherwise find the next distinct key and the beginning
+	 * of it's series, go one step back and read backward from there.
+	 *
+	 * An interesting situation can happen if one of distinct keys do not pass
+	 * a corresponding index condition at all. In this case reading backward
+	 * can lead to a previous distinct key being found, creating a loop. To
+	 * avoid that check the value to be returned, and jump one more time if
+	 * it's the same as at the beginning. Note that we do not check visibility
+	 * here, and dead tuples could also lead to the same situation. This has to
+	 * be checked on the caller side.
+	 */
+	else if (ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir))
+	{
+		if (scanstart)
+			offnum = OffsetNumberPrev(offnum);
+		else
+		{
+			OffsetNumber nextOffset,
+						startOffset,
+						jumpOffset;
+
+			IndexTuple startItup = CopyIndexTuple(scan->xs_itup);
+			Page page = BufferGetPage(so->currPos.buf);
+
+			/* We are at the end and need to return */
+			if ((offnum > PageGetMaxOffsetNumber(page)) &
+				(so->currPos.nextPage == P_NONE))
+			{
+				LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+
+				BTScanPosUnpinIfPinned(so->currPos);
+				BTScanPosInvalidate(so->currPos);
+
+				pfree(so->skipScanKey);
+				so->skipScanKey = NULL;
+				return false;
+			}
+
+			nextOffset = startOffset = ItemPointerGetOffsetNumber(&scan->xs_itup->t_tid);
+
+			/* Reading backwards means we expect to see more data on the left */
+			so->currPos.moreLeft = true;
+
+			while (nextOffset == startOffset)
+			{
+				IndexTuple itup;
+				CHECK_FOR_INTERRUPTS();
+
+				/*
+				 * Find a next index tuple to update scan key. It could be at
+				 * the end, so check for max offset
+				 */
+				if (!_bt_readpage(scan, ForwardScanDirection, offnum))
+				{
+					/*
+					 * There's no actually-matching data on this page.  Try to
+					 * advance to the next page. Return false if there's no
+					 * matching data at all.
+					 */
+					LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+					if (!_bt_steppage(scan, dir))
+					{
+						pfree(so->skipScanKey);
+						so->skipScanKey = NULL;
+						return false;
+					}
+					LockBuffer(so->currPos.buf, BT_READ);
+				}
+
+				currItem = &so->currPos.items[so->currPos.firstItem];
+				itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+				scan->xs_itup = itup;
+
+				_bt_update_skip_scankeys(scan, indexRel);
+				if (BufferIsValid(so->currPos.buf))
+				{
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_bt_killitems(scan);
+
+					LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+					ReleaseBuffer(so->currPos.buf);
+					so->currPos.buf = InvalidBuffer;
+				}
+
+				stack = _bt_search(scan->indexRelation, so->skipScanKey,
+								   &buf, BT_READ, scan->xs_snapshot);
+				_bt_freestack(stack);
+				so->currPos.buf = buf;
+				jumpOffset = offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+				offnum = OffsetNumberPrev(offnum);
+
+				if (!_bt_readpage(scan, indexdir, offnum))
+				{
+					/*
+					 * There's no actually-matching data on this page.  Try to
+					 * advance to the next page. Return false if there's no
+					 * matching data at all.
+					 */
+					LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+					if (!_bt_steppage(scan, indexdir))
+					{
+						pfree(so->skipScanKey);
+						so->skipScanKey = NULL;
+						return false;
+					}
+					LockBuffer(so->currPos.buf, BT_READ);
+				}
+
+				currItem = &so->currPos.items[so->currPos.lastItem];
+				itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+				nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);
+
+				/*
+				 * To check if we returned the same tuple, try to find a
+				 * startItup on the current page. For that we need to update
+				 * scankey to match the whole tuple and set nextkey to return
+				 * an exact tuple, not the next one. If the nextOffset is the
+				 * same as before, it means we are in the loop, return offnum
+				 * to the original position and jump further
+				 */
+				scan->xs_itup = startItup;
+				_bt_update_skip_scankeys(scan, indexRel);
+
+				so->skipScanKey->keysz = IndexRelationGetNumberOfKeyAttributes(indexRel);
+				so->skipScanKey->nextkey = false;
+
+				if (_bt_scankey_within_page(scan, so->skipScanKey,
+											so->currPos.buf, dir))
+				{
+					OffsetNumber maxoff;
+					startOffset = _bt_binsrch(scan->indexRelation,
+											  so->skipScanKey,
+											  so->currPos.buf);
+
+					page = BufferGetPage(so->currPos.buf);
+					maxoff = PageGetMaxOffsetNumber(page);
+
+					if (nextOffset <= startOffset)
+					{
+						offnum = jumpOffset;
+						nextOffset = startOffset;
+					}
+
+					if ((offnum > maxoff) & (so->currPos.nextPage == P_NONE))
+					{
+						LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+
+						BTScanPosUnpinIfPinned(so->currPos);
+						BTScanPosInvalidate(so->currPos);
+
+						pfree(so->skipScanKey);
+						so->skipScanKey = NULL;
+						return false;
+					}
+				}
+
+				/* Return original scankey options */
+				so->skipScanKey->keysz = prefix;
+				so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+			}
+		}
+	}
+
+	/* Now read the data */
+	if (!_bt_readpage(scan, indexdir, offnum))
+	{
+		/*
+		 * There's no actually-matching data on this page.  Try to advance to
+		 * the next page.  Return false if there's no matching data at all.
+		 */
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		if (!_bt_steppage(scan, dir))
+		{
+			pfree(so->skipScanKey);
+			so->skipScanKey = NULL;
+			return false;
+		}
+	}
+	else
+	{
+		/* Drop the lock, and maybe the pin, on the current page */
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+	}
+
+	/* And set IndexTuple */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_heaptid = currItem->heapTid;
+	scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+	so->currPos.moreLeft = true;
+	so->currPos.moreRight = true;
+
+	return true;
+}
+
 /*
  *	_bt_readpage() -- Load data from current index page into so->currPos
  *
@@ -2479,3 +2893,58 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 	so->numKilled = 0;			/* just paranoia */
 	so->markItemIndex = -1;		/* ditto */
 }
+
+/*
+ * _bt_update_skip_scankeys() -- set up a new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+	TupleDesc		itupdesc;
+	int			indnkeyatts,
+				i;
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	ScanKey			scankeys = so->skipScanKey->scankeys;
+
+	itupdesc = RelationGetDescr(indexRel);
+	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+	for (i = 0; i < indnkeyatts; i++)
+	{
+		Datum datum;
+		bool null;
+		int flags;
+
+		datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+		flags = (null ? SK_ISNULL : 0) |
+				(indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+		scankeys[i].sk_flags = flags;
+		scankeys[i].sk_argument = datum;
+	}
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ *
+ * Scankey nextkey will tell us if we need to find a current key or the next
+ * one, which affects whether or not it's ok to be equal to the page highkey.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+						Buffer buf, ScanDirection dir)
+{
+	OffsetNumber low, high;
+	Page page = BufferGetPage(buf);
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	int high_compare = key->nextkey ? 0 : 1;
+
+	low = P_FIRSTDATAKEY(opaque);
+	high = PageGetMaxOffsetNumber(page);
+
+	if (unlikely(high < low))
+		return false;
+
+	return (_bt_compare(scan->indexRelation, key, page, low) > 0 &&
+			_bt_compare(scan->indexRelation, key, page, high) < high_compare);
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 79506c748b..efd27c6261 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -938,6 +938,9 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* Work space for _bt_skip */
+	BTScanInsert	skipScanKey;	/* used to control skipping */
+
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -1095,6 +1098,8 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir,
+					 ScanDirection indexdir, bool start, int prefix);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 							   Snapshot snapshot);
 
@@ -1119,6 +1124,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
 extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir,
+				   ScanDirection indexdir, bool start, int prefix);
 extern bool btproperty(Oid index_oid, int attno,
 					   IndexAMProperty prop, const char *propname,
 					   bool *res, bool *isnull);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index a46b1573bd..a47e9faed2 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4429,6 +4429,8 @@ select d.* from d left join (select * from b group by b.id, b.c_id) s
          ->  Seq Scan on d
 (8 rows)
 
+-- disable index skip scan to prevent it interfering with the plan
+set enable_indexskipscan to off;
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
@@ -4446,6 +4448,7 @@ select d.* from d left join (select distinct * from b) s
          ->  Seq Scan on d
 (9 rows)
 
+set enable_indexskipscan to on;
 -- check join removal works when uniqueness of the join condition is enforced
 -- by a UNION
 explain (costs off)
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index 11c6f50fbf..b811f7d7a1 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -306,3 +306,624 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
  t
 (1 row)
 
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+CREATE INDEX ON distinct_a ((a + 1));
+ANALYZE distinct_a;
+SELECT DISTINCT a FROM distinct_a;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+ a 
+---
+ 1
+(1 row)
+
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Index Only Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+(2 rows)
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- test index skip scan for expressions
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT (a + 1) FROM distinct_a;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Scan using distinct_a_expr_idx on distinct_a
+   Skip scan: true
+(2 rows)
+
+SELECT DISTINCT (a + 1) FROM distinct_a;
+ ?column? 
+----------
+        2
+        3
+        4
+        5
+        6
+(5 rows)
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: true
+   Index Cond: (b = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: true
+   Index Cond: (b = 2)
+(3 rows)
+
+DROP INDEX distinct_a_b_a;
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b 
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+                          QUERY PLAN                          
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: true
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+                              QUERY PLAN                               
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: true
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+ 2 | 1 | 10
+ 3 | 1 | 10
+ 4 | 1 | 10
+ 5 | 1 | 10
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+                     QUERY PLAN                      
+-----------------------------------------------------
+ Unique
+   ->  Bitmap Heap Scan on distinct_a
+         Recheck Cond: (a = 1)
+         ->  Bitmap Index Scan on distinct_a_a_b_idx
+               Index Cond: (a = 1)
+(5 rows)
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Unique
+   ->  Index Scan using distinct_a_a_b_idx on distinct_a
+         Index Cond: (b = 2)
+         Filter: (c = 10)
+(4 rows)
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+ a | a 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 3
+ 4 | 4
+ 5 | 5
+(5 rows)
+
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+ a | ?column? 
+---+----------
+ 1 |        1
+ 2 |        1
+ 3 |        1
+ 4 |        1
+ 5 |        1
+(5 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+FETCH FROM c;
+ a 
+---
+ 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a 
+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 |  9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+                                 QUERY PLAN                                 
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+   Skip scan: true
+   Index Cond: ((b >= 1) AND (c = 0))
+(3 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c 
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
+-- test tuple killing
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 1 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 5 | 1 | 1 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 5 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 1 | 1 | 1 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(5 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(5 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 1403e0ffe7..13e35fda88 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1519,11 +1519,16 @@ explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
   on d.a = s.id;
 
+-- disable index skip scan to prevent it interfering with the plan
+set enable_indexskipscan to off;
+
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
   on d.a = s.id;
 
+set enable_indexskipscan to on;
+
 -- check join removal works when uniqueness of the join condition is enforced
 -- by a UNION
 explain (costs off)
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index 33102744eb..da07988d35 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -135,3 +135,257 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
 SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
 SELECT 2 IS NOT DISTINCT FROM null as "no";
 SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+CREATE INDEX ON distinct_a ((a + 1));
+ANALYZE distinct_a;
+
+SELECT DISTINCT a FROM distinct_a;
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- test index skip scan for expressions
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT (a + 1) FROM distinct_a;
+SELECT DISTINCT (a + 1) FROM distinct_a;
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+DROP INDEX distinct_a_b_a;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
+
+-- test tuple killing
+
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
-- 
2.25.0

0006-Index-skip-scan-documentation.patchapplication/octet-stream; name=0006-Index-skip-scan-documentation.patchDownload

From 0bea40cc7ab57fe5568e891a9e5e3d2d48054069 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:45:21 +0200
Subject: [PATCH 6/7] Index skip scan documentation

Covers added GUC and the extenion of index am interface.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 doc/src/sgml/config.sgml  | 15 ++++++++++
 doc/src/sgml/indexam.sgml | 62 +++++++++++++++++++++++++++++++++++++++
 doc/src/sgml/indices.sgml | 23 +++++++++++++++
 3 files changed, 100 insertions(+)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b353c61683..aa89e96fc0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4620,6 +4620,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+      <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of index-skip-scan plan
+        types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+        <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-material" xreflabel="enable_material">
       <term><varname>enable_material</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 8c552c22bd..b3ca91211f 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -694,6 +694,68 @@ amrestrpos (IndexScanDesc scan);
 
   <para>
 <programlisting>
+bool
+amskip (IndexScanDesc scan,
+        ScanDirection direction,
+        ScanDirection indexdir,
+        bool scanstart,
+        int prefix);
+</programlisting>
+  Skip past all tuples where the first 'prefix' columns have the same value as
+  the last tuple returned in the current scan. The arguments are:
+
+   <variablelist>
+    <varlistentry>
+     <term><parameter>scan</parameter></term>
+     <listitem>
+      <para>
+       Index scan information
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>direction</parameter></term>
+     <listitem>
+      <para>
+       The direction in which data is advancing.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>indexdir</parameter></term>
+     <listitem>
+      <para>
+        The index direction, in which data must be read.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>scanstart</parameter></term>
+     <listitem>
+      <para>
+        Whether or not it is a start of the scan.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>prefix</parameter></term>
+     <listitem>
+      <para>
+        Distinct prefix size.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+
+  </para>
+
+  <para>
+<programlisting>
 Size
 amestimateparallelscan (void);
 </programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 28adaba72d..1cf8550e0c 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1281,6 +1281,29 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
    and later will recognize such cases and allow index-only scans to be
    generated, but older versions will not.
   </para>
+
+  <sect2 id="indexes-index-skip-scans">
+    <title>Index Skip Scans</title>
+
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index</primary>
+      <secondary>index-skip scans</secondary>
+    </indexterm>
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index-skip scan</primary>
+    </indexterm>
+
+    <para>
+     When the rows retrieved from an index scan are then deduplicated by
+     eliminating rows matching on a prefix of index keys (e.g. when using
+     <literal>SELECT DISTINCT</literal>), the planner will consider
+     skipping groups of rows with a matching key prefix. When a row with
+     a particular prefix is found, remaining rows with the same key prefix
+     are skipped.  The larger the number of rows with the same key prefix
+     rows (i.e. the lower the number of distinct key prefixes in the index),
+     the more efficient this is.
+    </para>
+  </sect2>
  </sect1>
 
 
-- 
2.25.0

0007-planner-fixes.patchapplication/octet-stream; name=0007-planner-fixes.patchDownload

From df1225c9280c0a3b23b84750f1031470f765b640 Mon Sep 17 00:00:00 2001
From: Floris van Nee <florisvannee@optiver.com>
Date: Sun, 12 Jul 2020 13:36:25 +0200
Subject: [PATCH 7/7] planner fixes

---
 src/backend/optimizer/path/indxpath.c         | 44 +++++++++---
 src/backend/optimizer/path/pathkeys.c         | 72 +++++++++++++++++++
 src/backend/optimizer/path/uniquekeys.c       | 65 ++++++++---------
 src/backend/optimizer/plan/planner.c          | 16 ++---
 src/backend/optimizer/util/pathnode.c         | 41 ++---------
 src/include/optimizer/pathnode.h              |  2 +-
 src/include/optimizer/paths.h                 |  4 ++
 src/test/regress/expected/select_distinct.out | 14 ++--
 src/test/regress/expected/sysviews.out        |  2 +-
 9 files changed, 162 insertions(+), 98 deletions(-)

diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 340bbfcbb3..9eff71e5d5 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -1119,15 +1119,27 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 		/* Consider index skip scan as well */
 		if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
 		{
-			ListCell   *lc;
+			int numusefulkeys = list_length(useful_pathkeys);
+			int numsortkeys = list_length(root->query_pathkeys);
 
-			foreach(lc, root->query_uniquekeys)
+			if (numusefulkeys == numsortkeys)
 			{
-				UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+				int prefix;
+				if (list_length(root->distinct_pathkeys) > 0)
+					prefix = find_index_prefix_for_pathkey(root,
+														   index,
+														   ForwardScanDirection,
+														   llast_node(PathKey,
+														   root->distinct_pathkeys));
+				else
+					/* all are distinct keys are constant and optimized away.
+					 * skipping with 1 is sufficient as all are constant anyway
+					 */
+					prefix = 1;
+
 				result = lappend(result,
 								 create_skipscan_unique_path(root, index,
-															 (Path *) ipath,
-															 ukey->exprs));
+															 (Path *) ipath, prefix));
 			}
 		}
 
@@ -1189,15 +1201,27 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			/* Consider index skip scan as well */
 			if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
 			{
-				ListCell   *lc;
+				int numusefulkeys = list_length(useful_pathkeys);
+				int numsortkeys = list_length(root->query_pathkeys);
 
-				foreach(lc, root->query_uniquekeys)
+				if (numusefulkeys == numsortkeys)
 				{
-					UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+					int prefix;
+					if (list_length(root->distinct_pathkeys) > 0)
+						prefix = find_index_prefix_for_pathkey(root,
+															   index,
+															   BackwardScanDirection,
+															   llast_node(PathKey,
+															   root->distinct_pathkeys));
+					else
+						/* all are distinct keys are constant and optimized away.
+						 * skipping with 1 is sufficient as all are constant anyway
+						 */
+						prefix = 1;
+
 					result = lappend(result,
 									 create_skipscan_unique_path(root, index,
-																 (Path *) ipath,
-																 ukey->exprs));
+																 (Path *) ipath, prefix));
 				}
 			}
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index a4fc4f252d..3fa533be95 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -522,6 +522,78 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  *		NEW PATHKEY FORMATION
  ****************************************************************************/
 
+/*
+ * Find the prefix size for a specific path key in an index.
+ * For example, an index with (a,b,c) finding path key b will
+ * return prefix 2.
+ * Returns 0 when not found.
+ */
+int
+find_index_prefix_for_pathkey(PlannerInfo *root,
+					 IndexOptInfo *index,
+					 ScanDirection scandir,
+					 PathKey *pathkey)
+{
+	ListCell   *lc;
+	int			i;
+
+	i = 0;
+	foreach(lc, index->indextlist)
+	{
+		TargetEntry *indextle = (TargetEntry *) lfirst(lc);
+		Expr	   *indexkey;
+		bool		reverse_sort;
+		bool		nulls_first;
+		PathKey    *cpathkey;
+
+		/*
+		 * INCLUDE columns are stored in index unordered, so they don't
+		 * support ordered index scan.
+		 */
+		if (i >= index->nkeycolumns)
+			break;
+
+		/* We assume we don't need to make a copy of the tlist item */
+		indexkey = indextle->expr;
+
+		if (ScanDirectionIsBackward(scandir))
+		{
+			reverse_sort = !index->reverse_sort[i];
+			nulls_first = !index->nulls_first[i];
+		}
+		else
+		{
+			reverse_sort = index->reverse_sort[i];
+			nulls_first = index->nulls_first[i];
+		}
+
+		/*
+		 * OK, try to make a canonical pathkey for this sort key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  indexkey,
+											  NULL,
+											  index->sortopfamily[i],
+											  index->opcintype[i],
+											  index->indexcollations[i],
+											  reverse_sort,
+											  nulls_first,
+											  0,
+											  index->rel->relids,
+											  false);
+
+		if (cpathkey == pathkey)
+		{
+			return i + 1;
+		}
+
+		i++;
+	}
+
+	return 0;
+}
+
 /*
  * build_index_pathkeys
  *	  Build a pathkeys list that describes the ordering induced by an index
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
index 7d9c0aefbf..4bc16ea023 100644
--- a/src/backend/optimizer/path/uniquekeys.c
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -1136,6 +1136,7 @@ build_uniquekeys(PlannerInfo *root, List *sortclauses)
 	List *result = NIL;
 	List *sortkeys;
 	ListCell *l;
+	List *exprs = NIL;
 
 	sortkeys = make_pathkeys_for_uniquekeys(root,
 											sortclauses,
@@ -1146,54 +1147,48 @@ build_uniquekeys(PlannerInfo *root, List *sortclauses)
 	{
 		PathKey    *pathkey = (PathKey *) lfirst(l);
 		EquivalenceClass *ec = pathkey->pk_eclass;
-		ListCell   *k;
-		List *exprs = NIL;
-
-		foreach(k, ec->ec_members)
-		{
-			EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
-			exprs = lappend(exprs, mem->em_expr);
-		}
-
-		result = lappend(result, makeUniqueKey(exprs, false, false));
+		EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members));
+		if (EC_MUST_BE_REDUNDANT(ec))
+			continue;
+		exprs = lappend(exprs, mem->em_expr);
 	}
 
+	result = lappend(result, makeUniqueKey(exprs, false));
+
 	return result;
 }
 
 bool
-query_has_uniquekeys_for(PlannerInfo *root, List *pathkeys,
+query_has_uniquekeys_for(PlannerInfo *root, List *pathuniquekeys,
 						 bool allow_multinulls)
 {
 	ListCell *lc;
-	List *exprs = NIL;
+	ListCell *lc2;
 
-	/* For UniqueKey->onerow case, the uniquekey->exprs is empty as well
-	 * so we can't rely on list_is_subset to handle this special cases
+	/* root->query_uniquekeys are the requested DISTINCT clauses on query level
+	 * pathuniquekeys are the unique keys on current path.
+	 * All requested query_uniquekeys must be satisfied by the pathuniquekeys
 	 */
-	if (pathkeys == NIL)
-		return false;
-
-	foreach(lc, pathkeys)
+	foreach(lc, root->query_uniquekeys)
 	{
-		PathKey    *pathkey = (PathKey *) lfirst(lc);
-		EquivalenceClass *ec = pathkey->pk_eclass;
-		ListCell   *k;
-
-		foreach(k, ec->ec_members)
+		UniqueKey *query_ukey = lfirst_node(UniqueKey, lc);
+		bool satisfied = false;
+		foreach(lc2, pathuniquekeys)
 		{
-			EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
-			exprs = lappend(exprs, mem->em_expr);
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc2);
+			if (ukey->multi_nullvals && !allow_multinulls)
+				continue;
+			if (list_length(ukey->exprs) == 0 &&
+				list_length(query_ukey->exprs) != 0)
+				continue;
+			if (list_is_subset(ukey->exprs, query_ukey->exprs))
+			{
+				satisfied = true;
+				break;
+			}
 		}
+		if (!satisfied)
+			return false;
 	}
-
-	foreach(lc, root->query_uniquekeys)
-	{
-		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
-		if (ukey->multi_nullvals && !allow_multinulls)
-			continue;
-		if (list_is_subset(ukey->exprs, exprs))
-			return true;
-	}
-	return false;
+	return true;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index a5f3c28fb5..0d986e87e6 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3627,12 +3627,18 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 
 	if (parse->distinctClause &&
 		grouping_is_sortable(parse->distinctClause))
+	{
 		root->distinct_pathkeys =
 			make_pathkeys_for_sortclauses(root,
 										  parse->distinctClause,
 										  tlist);
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
+	{
 		root->distinct_pathkeys = NIL;
+		root->query_uniquekeys = NIL;
+	}
 
 	root->sort_pathkeys =
 		make_pathkeys_for_sortclauses(root,
@@ -3663,17 +3669,9 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 		root->query_pathkeys = root->window_pathkeys;
 	else if (list_length(root->distinct_pathkeys) >
 			 list_length(root->sort_pathkeys))
-	{
 		root->query_pathkeys = root->distinct_pathkeys;
-		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
-	}
 	else if (root->sort_pathkeys)
-	{
 		root->query_pathkeys = root->sort_pathkeys;
-
-		if (root->distinct_pathkeys)
-			root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
-	}
 	else
 		root->query_pathkeys = NIL;
 }
@@ -4834,7 +4832,7 @@ create_distinct_paths(PlannerInfo *root,
 		{
 			Path	   *path = (Path *) lfirst(lc);
 
-			if (query_has_uniquekeys_for(root, needed_pathkeys, false))
+			if (query_has_uniquekeys_for(root, path->uniquekeys, false))
 				add_path(distinct_rel, path);
 		}
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index cac2ca0b0f..3e0d89071f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2981,51 +2981,24 @@ create_upper_unique_path(PlannerInfo *root,
  */
 IndexPath *
 create_skipscan_unique_path(PlannerInfo *root, IndexOptInfo *index,
-							Path *basepath, List *unique_exprs)
+							Path *basepath, int prefix)
 {
 	IndexPath 	*pathnode = makeNode(IndexPath);
 	int 		numDistinctRows;
-	int 		distinctPrefixKeys;
-	ListCell 	*lc;
-	List 	   	*exprs = NIL;
-
-
-	distinctPrefixKeys = list_length(unique_exprs);
+	UniqueKey *ukey;
 
 	Assert(IsA(basepath, IndexPath));
 
 	/* We don't want to modify basepath, so make a copy. */
 	memcpy(pathnode, basepath, sizeof(IndexPath));
 
-	/*
-	 * Normally we can think about distinctPrefixKeys as just
-	 * a number of distinct keys. But if lets say we have a
-	 * distinct key a, and the index contains b, a in exactly
-	 * this order. In such situation we need to use position
-	 * of a in the index as distinctPrefixKeys, otherwise skip
-	 * will happen only by the first column.
-	 */
-	foreach(lc, unique_exprs)
-	{
-		Expr *unique_expr = (Expr *) lfirst(lc);
-		Var *var = (Var *) unique_expr;
-
-		exprs = lappend(exprs, unique_expr);
-
-		for (int i = 0; i < index->ncolumns; i++)
-		{
-			if (index->indexkeys[i] == var->varattno)
-			{
-				distinctPrefixKeys = Max(i + 1, distinctPrefixKeys);
-				break;
-			}
-		}
-	}
+	ukey = linitial_node(UniqueKey, root->query_uniquekeys);
 
-	Assert(distinctPrefixKeys > 0);
-	pathnode->indexskipprefix = distinctPrefixKeys;
+	Assert(prefix > 0);
+	pathnode->indexskipprefix = prefix;
+	pathnode->path.uniquekeys = root->query_uniquekeys;
 
-	numDistinctRows = estimate_num_groups(root, exprs,
+	numDistinctRows = estimate_num_groups(root, ukey->exprs,
 										  pathnode->path.rows,
 										  NULL);
 
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 705deaf7bd..8ec1780a56 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -210,7 +210,7 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
 extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
 											  IndexOptInfo *index,
 											  Path *subpath,
-											  List *unique_exprs);
+											  int prefix);
 extern AggPath *create_agg_path(PlannerInfo *root,
 								RelOptInfo *rel,
 								Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 0cb8030e33..f934f0011a 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -198,6 +198,10 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   Relids required_outer,
 													   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
+extern int find_index_prefix_for_pathkey(PlannerInfo *root,
+					 IndexOptInfo *index,
+					 ScanDirection scandir,
+					 PathKey *pathkey);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 								  ScanDirection scandir);
 extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index b811f7d7a1..7af4636b1b 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -647,14 +647,12 @@ FROM distinct_a ORDER BY a;
 EXPLAIN (COSTS OFF)
 SELECT DISTINCT ON (a) a, b, c
 FROM distinct_a WHERE a = 1 ORDER BY a;
-                     QUERY PLAN                      
------------------------------------------------------
- Unique
-   ->  Bitmap Heap Scan on distinct_a
-         Recheck Cond: (a = 1)
-         ->  Bitmap Index Scan on distinct_a_a_b_idx
-               Index Cond: (a = 1)
-(5 rows)
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+   Index Cond: (a = 1)
+(3 rows)
 
 -- check colums order
 SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 4da9594925..e64e20a8cb 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -91,7 +91,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(20 rows)
+(19 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
-- 
2.25.0

#10

9erthalion6@gmail.com

over 5 years ago

In reply to: Floris Van Nee (#8)

Re: Index Skip Scan (new UniqueKeys)

On Sun, Jul 12, 2020 at 12:48:47PM +0000, Floris Van Nee wrote:

Good point, thanks for looking at this. With the latest planner version there
are indeed more possibilities to use skipping. It never occured to me that
some of those paths will still rely on index scan returning full data set. I'll look
in details and add verification to prevent putting something like this on top of
skip scan in the next version.

I believe the required changes are something like in attached patch. There were a few things I've changed:
- build_uniquekeys was constructing the list incorrectly. For a DISTINCT a,b, it would create two unique keys, one with a and one with b. However, it should be one unique key with (a,b).

Yes, I've also noticed that while preparing fix for index scan not
covered by index and included it.

- the uniquekeys that is built, still contains some redundant keys, that are normally eliminated from the path keys lists.

I guess you're talking about:

+ if (EC_MUST_BE_REDUNDANT(ec))
+ continue;

Can you add some test cases to your changes to show the effect of it? It
seem to me redundant keys are already eliminated at this point by either
make_pathkeys_for_uniquekeys or even earlier for distinct on, but could
be I've missed something.

Along the lines I'm also curious about this part:

-	ListCell   *k;
-	List *exprs = NIL;
-
-	foreach(k, ec->ec_members)
-	{
-		EquivalenceMember *mem = (EquivalenceMember *) lfirst(k);
-		exprs = lappend(exprs, mem->em_expr);
-	}
-
-	result = lappend(result, makeUniqueKey(exprs, false, false));
+	EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members));

I'm curious about this myself, maybe someone can clarify. It looks like
generaly speaking there could be more than one member (if not
ec_has_volatile), which "representing knowledge that multiple items are
effectively equal". Is this information is not interesting enough to
preserve it in unique keys?

- the distinct_pathkeys may be NULL, even though there's a possibility for skipping. But it wouldn't create the uniquekeys in this case. This makes the planner not choose skip scans even though it could. For example in queries that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a is constant, it's eliminated from regular pathkeys.

What would be the point of skipping if it's a constant?

- to combat the issues mentioned earlier, there's now a check in build_index_paths that checks if the query_pathkeys matches the useful_pathkeys. Note that we have to use the path keys here rather than any of the unique keys. The unique keys are only Expr nodes - they do not contain the necessary information about ordering. Due to elimination of some constant path keys, we have to search the attributes of the index to find the correct prefix to use in skipping.

IIUC here you mean this function, right?

+ prefix = find_index_prefix_for_pathkey(root,
+ 									     index,
+ 									     BackwardScanDirection,
+ 									     llast_node(PathKey,
+ 									     root->distinct_pathkeys));

Doesn't it duplicate the job already done in build_index_pathkeys by
building those pathkeys again? If yes, probably it's possible to reuse
useful_pathkeys. Not sure about unordered indexes, but looks like
query_pathkeys should also match in this case.

Will also look at the follow up questions in the next email.

#11

florisvannee@Optiver.com

over 5 years ago

In reply to: Dmitry Dolgov (#10)

RE: Index Skip Scan (new UniqueKeys)

- the uniquekeys that is built, still contains some redundant keys, that are

normally eliminated from the path keys lists.

I guess you're talking about:

+ if (EC_MUST_BE_REDUNDANT(ec))
+ continue;

Can you add some test cases to your changes to show the effect of it? It
seem to me redundant keys are already eliminated at this point by either
make_pathkeys_for_uniquekeys or even earlier for distinct on, but could be
I've missed something.

The build_uniquekeys function calls make_pathkeys_for_uniquekeys, which checks for uniqueness using pathkey_is_unique, but not for constantness. Consider a query like:
SELECT DISTINCT ON (a,b) * FROM t1 WHERE a=1 ORDER BY a,b,c

All the regular path keys filter out 'a' for constantness here - so they would end up with a distinct pathkeys of (b) and sort path keys of (b,c).
The unique keys would end up with (a,b) though. In the previous patch, it'd compared in create_distinct_paths, the pathkeys to the unique keys, so it wouldn't consider the skip scan.
Due to the other changes I made in create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan now, even without the EC_MUST_BE_REDUNDANT check though, so it's difficult to give an actual failing test case now. However, since all code filters out constant keys, I think uniqueness should do it too - otherwise you could get into problems later on. It's also more consistent. If you already know something is unique by just (b), it doesn't make sense to store that it's unique by (a,b). Now that I think of it, the best place to do this EC_MUST_BE_REDUNDANT check is probably inside make_pathkeys_for_uniquekeys, rather than build_uniquekeys though. It's probably good to move it there.

Along the lines I'm also curious about this part:
-	ListCell   *k;
-	List *exprs = NIL;
-
-	foreach(k, ec->ec_members)
-	{
-		EquivalenceMember *mem = (EquivalenceMember *)
lfirst(k);
-		exprs = lappend(exprs, mem->em_expr);
-	}
-
-	result = lappend(result, makeUniqueKey(exprs, false, false));
+	EquivalenceMember *mem = (EquivalenceMember*)
+lfirst(list_head(ec->ec_members));
I'm curious about this myself, maybe someone can clarify. It looks like
generaly speaking there could be more than one member (if not
ec_has_volatile), which "representing knowledge that multiple items are
effectively equal". Is this information is not interesting enough to preserve it
in unique keys?

Yeah, that's a good question. Hence my question about the choice for Expr rather than EquivalenceClass for the Unique Keys patch to Andy/David. When storing just Expr, it is rather difficult to check equivalence in joins for example. Suppose, later on we decide to add join support to the distinct skip scan. Consider a query like this:
SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a
As far as my understanding goes (I didn't look into it in detail though), I think here the distinct_pathkey will have an EqClass {t1.a, t2.a}. That results in a UniqueKey with expr (t1.a) (because currently we only take the first Expr in the list to construct the UniqueKey). We could also construct *two* unique keys, one with Expr (t1.a) and one with Expr (t2.a), but I don't think that's the correct approach either, as it will explode when you have multiple pathkeys, each having multiple Expr inside their EqClasses.
That makes it difficult to check if we can perform the DISTINCT skip scan on table t2 as well (theoretically we could, but for that we need to check equivalence classes rather than expressions).

- the distinct_pathkeys may be NULL, even though there's a possibility for

skipping. But it wouldn't create the uniquekeys in this case. This makes the
planner not choose skip scans even though it could. For example in queries
that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a
is constant, it's eliminated from regular pathkeys.

What would be the point of skipping if it's a constant?

For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b;
There may be 1000s of records with a=1. We're only interested in the first one though. The traditional non-skip approach would still scan all records with a=1. Skip would just fetch the first one with a=1 and then skip to the next prefix and stop the scan.

- to combat the issues mentioned earlier, there's now a check in

build_index_paths that checks if the query_pathkeys matches the
useful_pathkeys. Note that we have to use the path keys here rather than
any of the unique keys. The unique keys are only Expr nodes - they do not
contain the necessary information about ordering. Due to elimination of
some constant path keys, we have to search the attributes of the index to
find the correct prefix to use in skipping.

IIUC here you mean this function, right?
+ prefix = find_index_prefix_for_pathkey(root,
+
index,
+
BackwardScanDirection,
+
llast_node(PathKey,
+
root->distinct_pathkeys));
Doesn't it duplicate the job already done in build_index_pathkeys by building
those pathkeys again? If yes, probably it's possible to reuse useful_pathkeys.
Not sure about unordered indexes, but looks like query_pathkeys should
also match in this case.

Yeah, there's definitely some double work there, but the actual impact may be limited - it doesn't actually allocate a new path key, but it looks it up in root->canon_pathkeys and returns that path key.
I wrote it like this, because I couldn't find a way to identify from a certain PathKey the actual location in the index of that column. The constructed path keys list filters out all redundant path keys. An index on (a,a,b,a,b) becomes path keys (a,b). Now if we skip on (a,b) we actually need to use prefix=3. But how to get from PathKey=b to that number 3, I didn't find a solid way except doing this. Maybe there is though?

-Floris

#12

pg@bowt.ie

over 5 years ago

In reply to: Dmitry Dolgov (#6)

Re: Index Skip Scan (new UniqueKeys)

On Sat, Jul 11, 2020 at 9:10 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

+               currItem = &so->currPos.items[so->currPos.lastItem];
+               itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+               nextOffset = ItemPointerGetOffsetNumber(&itup->t_tid);
Do you mean this last part with t_tid, which could also have a tid array
in case of posting tuple format?

Yeah. There is a TID array at the end of the index when the tuple is a
posting list tuple (as indicated by BTreeTupleIsPivot()). It isn't
safe to assume that t_tid is a heap TID for this reason, even in code
that only ever considers data items (that is, non high-key tuples AKA
non-pivot tuples) on a leaf page. (Though new/incoming tuples cannot
be posting list tuples either, so you'll see the assumption that t_tid
is just a heap TID in parts of nbtinsert.c -- though only for the
new/incoming item.)

Well, it's obviously wrong, thanks for noticing. What is necessary is to
compare two index tuples, the start and the next one, to test if they're
the same (in which case if I'm not mistaken probably we can compare item
pointers). I've got this question when I was about to post a new version
with changes to address feedback from Andy, now I'll combine them and
send a cumulative patch.

This sounds like approximately the same problem as the one that
_bt_killitems() has to deal with as of Postgres 13. This is handled in
a way that is admittedly pretty tricky, even though the code does not
need to be 100% certain that it's "the same" tuple. Deduplication kind
of makes that a fuzzy concept. In principle there could be one big
index tuple instead of 5 tuples, even though the logical contents of
the page have not been changed between the time we recording heap TIDs
in local and the time _bt_killitems() tried to match on those heap
TIDs to kill_prior_tuple-kill some index tuples -- a concurrent
deduplication pass could do that. Your code needs to be prepared for
stuff like that.

Ultimately posting list tuples are just a matter of understanding the
on-disk representation -- a "Small Matter of Programming". Even
without deduplication there are potential hazards from the physical
deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is
not code that runs in VACUUM, despite the name). Make sure that you
hold a buffer pin on the leaf page throughout, because you need to do
that to make sure that VACUUM cannot concurrently recycle heap TIDs.
If VACUUM *is* able to concurrently recycle heap TIDs then it'll be
subtly broken. _bt_killitems() is safe because it either holds on to a
pin or gives up when the LSN changes at all. (ISTM that your only
choice is to hold on to a leaf page pin, since you cannot just decide
to give up in the way that _bt_killitems() sometimes can.)

Note that the rules surrounding buffer locks/pins for nbtree were
tightened up a tiny bit today -- see commit 4a70f829. Also, it's no
longer okay to use raw LockBuffer() calls in nbtree, so you're going
to have to fix that up when you next rebase -- you must use the new
_bt_lockbuf() wrapper function instead, so that the new Valgrind
instrumentation is used. This shouldn't be hard.

Perhaps you can use Valgrind to verify that this patch doesn't have
any unsafe buffer accesses. I recall problems like that in earlier
versions of the patch series. Valgrind should be able to detect most
bugs like that (though see comments within _bt_lockbuf() for details
of a bug in this area that Valgrind cannot detect even now).

--
Peter Geoghegan

#13

[1]: /messages/by-id/e4b623692a1447d4a13ac80fa271c8e6@opammb0561.comp.optiver.com

9erthalion6@gmail.com

over 5 years ago

In reply to: Floris Van Nee (#11)

Re: Index Skip Scan (new UniqueKeys)

On Tue, Jul 14, 2020 at 06:18:50PM +0000, Floris Van Nee wrote:

Due to the other changes I made in create_distinct_paths/query_has_uniquekeys_for, it will choose a correct plan now, even without the EC_MUST_BE_REDUNDANT check though, so it's difficult to give an actual failing test case now. However, since all code filters out constant keys, I think uniqueness should do it too - otherwise you could get into problems later on. It's also more consistent. If you already know something is unique by just (b), it doesn't make sense to store that it's unique by (a,b). Now that I think of it, the best place to do this EC_MUST_BE_REDUNDANT check is probably inside make_pathkeys_for_uniquekeys, rather than build_uniquekeys though. It's probably good to move it there.

That would be my suggestion as well.

Along the lines I'm also curious about this part:
-	ListCell   *k;
-	List *exprs = NIL;
-
-	foreach(k, ec->ec_members)
-	{
-		EquivalenceMember *mem = (EquivalenceMember *)
lfirst(k);
-		exprs = lappend(exprs, mem->em_expr);
-	}
-
-	result = lappend(result, makeUniqueKey(exprs, false, false));
+	EquivalenceMember *mem = (EquivalenceMember*)
+lfirst(list_head(ec->ec_members));
I'm curious about this myself, maybe someone can clarify. It looks like
generaly speaking there could be more than one member (if not
ec_has_volatile), which "representing knowledge that multiple items are
effectively equal". Is this information is not interesting enough to preserve it
in unique keys?
Yeah, that's a good question. Hence my question about the choice for Expr rather than EquivalenceClass for the Unique Keys patch to Andy/David. When storing just Expr, it is rather difficult to check equivalence in joins for example. Suppose, later on we decide to add join support to the distinct skip scan. Consider a query like this:
SELECT DISTINCT t1.a FROM t1 JOIN t2 ON t1.a=t2.a
As far as my understanding goes (I didn't look into it in detail though), I think here the distinct_pathkey will have an EqClass {t1.a, t2.a}. That results in a UniqueKey with expr (t1.a) (because currently we only take the first Expr in the list to construct the UniqueKey). We could also construct *two* unique keys, one with Expr (t1.a) and one with Expr (t2.a), but I don't think that's the correct approach either, as it will explode when you have multiple pathkeys, each having multiple Expr inside their EqClasses.

One UniqueKey can have multiple corresponding expressions, which gives
us also possibility of having one unique key with (t1.a, t2.a) and it
looks now similar to EquivalenceClass.

- the distinct_pathkeys may be NULL, even though there's a possibility for

skipping. But it wouldn't create the uniquekeys in this case. This makes the
planner not choose skip scans even though it could. For example in queries
that do SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b; Since a
is constant, it's eliminated from regular pathkeys.

What would be the point of skipping if it's a constant?

For the query: SELECT DISTINCT ON (a) * FROM t1 WHERE a=1 ORDER BY a,b;
There may be 1000s of records with a=1. We're only interested in the first one though. The traditional non-skip approach would still scan all records with a=1. Skip would just fetch the first one with a=1 and then skip to the next prefix and stop the scan.

The idea behind this query sounds questionable to me, more transparent
would be to do this without distinct, skipping will actually do exactly
the same stuff just under another name. But if allowing skipping on
constants do not bring significant changes in the code probably it's
fine.

- to combat the issues mentioned earlier, there's now a check in

build_index_paths that checks if the query_pathkeys matches the
useful_pathkeys. Note that we have to use the path keys here rather than
any of the unique keys. The unique keys are only Expr nodes - they do not
contain the necessary information about ordering. Due to elimination of
some constant path keys, we have to search the attributes of the index to
find the correct prefix to use in skipping.

IIUC here you mean this function, right?
+ prefix = find_index_prefix_for_pathkey(root,
+
index,
+
BackwardScanDirection,
+
llast_node(PathKey,
+
root->distinct_pathkeys));
Doesn't it duplicate the job already done in build_index_pathkeys by building
those pathkeys again? If yes, probably it's possible to reuse useful_pathkeys.
Not sure about unordered indexes, but looks like query_pathkeys should
also match in this case.
Yeah, there's definitely some double work there, but the actual impact may be limited - it doesn't actually allocate a new path key, but it looks it up in root->canon_pathkeys and returns that path key.
I wrote it like this, because I couldn't find a way to identify from a certain PathKey the actual location in the index of that column. The constructed path keys list filters out all redundant path keys. An index on (a,a,b,a,b) becomes path keys (a,b). Now if we skip on (a,b) we actually need to use prefix=3. But how to get from PathKey=b to that number 3, I didn't find a solid way except doing this. Maybe there is though?

I don't think there is a direct way, but why not modify
build_index_paths to also provide this information, or compare
index_pathkeys expressions with indextlist without actually create those
pathkeys again?

And couple of words about this thread [1]/messages/by-id/e4b623692a1447d4a13ac80fa271c8e6@opammb0561.comp.optiver.com. It looks to me like a strange
way of interacting with the community. Are you going to duplicate there
everything, or what are your plans? At the very least you could try to
include everyone involved in the recipients list, not exclude some of
the authors.

#14

[1]: /messages/by-id/20200609102247.jdlatmfyeecg52fi@localhost

florisvannee@Optiver.com

over 5 years ago

In reply to: Dmitry Dolgov (#13)

RE: Index Skip Scan (new UniqueKeys)

One UniqueKey can have multiple corresponding expressions, which gives us
also possibility of having one unique key with (t1.a, t2.a) and it looks now
similar to EquivalenceClass.

I believe the current definition of a unique key with two expressions (t1.a, t2.a) means that it's unique on the tuple (t1.a, t2.a) - this gives weaker guarantees than uniqueness on (t1.a) and uniqueness on (t2.a).

The idea behind this query sounds questionable to me, more transparent
would be to do this without distinct, skipping will actually do exactly the same
stuff just under another name. But if allowing skipping on constants do not
bring significant changes in the code probably it's fine.

Yeah indeed, I didn't say it's a query that people should generally write. :-) It's better to write as a regular SELECT with LIMIT 1 of course. However, it's more to be consistent and predictable to the user: if a SELECT DISTINCT ON (a) * FROM t1 runs fast, then it doesn't make sense to the user if a SELECT DISTINCT ON (a) * FROM t1 WHERE a=2 runs slow. And to support it also makes the implementation more consistent with little code changes.

Yeah, there's definitely some double work there, but the actual impact may

be limited - it doesn't actually allocate a new path key, but it looks it up in
root->canon_pathkeys and returns that path key.

I wrote it like this, because I couldn't find a way to identify from a certain

PathKey the actual location in the index of that column. The constructed path
keys list filters out all redundant path keys. An index on (a,a,b,a,b) becomes
path keys (a,b). Now if we skip on (a,b) we actually need to use prefix=3. But
how to get from PathKey=b to that number 3, I didn't find a solid way except
doing this. Maybe there is though?

I don't think there is a direct way, but why not modify build_index_paths to
also provide this information, or compare index_pathkeys expressions with
indextlist without actually create those pathkeys again?

I agree there could be other ways - I don't currently have a strong preference for either. I can have a look at this later.

And couple of words about this thread [1]. It looks to me like a strange way
of interacting with the community. Are you going to duplicate there
everything, or what are your plans? At the very least you could try to include
everyone involved in the recipients list, not exclude some of the authors.

When I wrote the first mail in the thread, I went to this thread [1]/messages/by-id/20200609102247.jdlatmfyeecg52fi@localhost and included everyone from there, but I see now that I only included the to: and cc: people and forgot the original thread author, you. I'm sorry about that - I should've looked better to make sure I had everyone.
In any case, my plan is to keep the patch at least applicable to master, as I believe it can be helpful for discussions about both patches.

#15

[1]: /messages/by-id/CAApHDvrEXcadNYAAdq6RO0eKZUG6rRHXJGAbpzj8y432gCD9bA@mail.gmail.com
[2]: /messages/by-id/20190920051857.2fhnvhvx4qdddviz@alap3.anarazel.de

9erthalion6@gmail.com

over 5 years ago

In reply to: Peter Geoghegan (#12)

Re: Index Skip Scan (new UniqueKeys)

On Tue, Jul 21, 2020 at 04:35:55PM -0700, Peter Geoghegan wrote:

Well, it's obviously wrong, thanks for noticing. What is necessary is to
compare two index tuples, the start and the next one, to test if they're
the same (in which case if I'm not mistaken probably we can compare item
pointers). I've got this question when I was about to post a new version
with changes to address feedback from Andy, now I'll combine them and
send a cumulative patch.

This sounds like approximately the same problem as the one that
_bt_killitems() has to deal with as of Postgres 13. This is handled in
a way that is admittedly pretty tricky, even though the code does not
need to be 100% certain that it's "the same" tuple. Deduplication kind
of makes that a fuzzy concept. In principle there could be one big
index tuple instead of 5 tuples, even though the logical contents of
the page have not been changed between the time we recording heap TIDs
in local and the time _bt_killitems() tried to match on those heap
TIDs to kill_prior_tuple-kill some index tuples -- a concurrent
deduplication pass could do that. Your code needs to be prepared for
stuff like that.

Ultimately posting list tuples are just a matter of understanding the
on-disk representation -- a "Small Matter of Programming". Even
without deduplication there are potential hazards from the physical
deletion of LP_DEAD-marked tuples in _bt_vacuum_one_page() (which is
not code that runs in VACUUM, despite the name). Make sure that you
hold a buffer pin on the leaf page throughout, because you need to do
that to make sure that VACUUM cannot concurrently recycle heap TIDs.
If VACUUM *is* able to concurrently recycle heap TIDs then it'll be
subtly broken. _bt_killitems() is safe because it either holds on to a
pin or gives up when the LSN changes at all. (ISTM that your only
choice is to hold on to a leaf page pin, since you cannot just decide
to give up in the way that _bt_killitems() sometimes can.)

I see, thanks for clarification. You're right, in this part of
implementation there is no way to give up if LSN changes like
_bt_killitems does. As far as I can see the leaf page is already pinned
all the time between reading relevant tuples and comparing them, I only
need to handle posting list tuples.

#16

David Rowley

dgrowleyml@gmail.com

over 5 years ago

In reply to: Floris Van Nee (#9)

Re: Index Skip Scan (new UniqueKeys)

On Mon, 13 Jul 2020 at 10:18, Floris Van Nee <florisvannee@optiver.com> wrote:

One question about the unique keys - probably for Andy or David: I've looked in the archives to find arguments for/against using Expr nodes or EquivalenceClasses in the Unique Keys patch. However, I couldn't really find a clear answer about why the current patch uses Expr rather than EquivalenceClasses. At some point David mentioned "that probably Expr nodes were needed rather than EquivalenceClasses", but it's not really clear to me why. What were the thoughts behind this?

I'm still not quite sure on this either way. I did think
EquivalenceClasses were more suitable before I wrote the POC patch for
unique keys. But after that, I had in mind that Exprs might be
better. The reason I thought this was due to the fact that the
DISTINCT clause list is a bunch of Exprs and if the UniqueKeys were
EquivalenceClasses then checking to see if the DISTINCT can be skipped
turned into something more complex that required looking through lists
of ec_members rather than just checking if the uniquekey exprs were a
subset of the DISTINCT clause.

Thinking about it a bit harder, if we did use Exprs then it would mean
it a case like the following wouldn't work for Andy's DISTINCT no-op
stuff.

CREATE TABLE xy (x int primary key, y int not null);

SELECT DISTINCT y FROM xy WHERE x=y;

whereas if we use EquivalenceClasses then we'll find that we have an
EC with x,y in it and can skip the DISTINCT since we have a UniqueKey
containing that EquivalenceClass.

Also, looking at what Andy wrote to make a case like the following
work in his populate_baserel_uniquekeys() function in the 0002 patch:

CREATE TABLE ab (a int, b int, primary key(a,b));
SELECT DISTINCT a FROM ab WHERE b = 1;

it's a bit uninspiring. Really what we want here when checking if we
can skip doing the DISTINCT is a UniqueKey set using
EquivalenceClasses as we can just insist that any unmatched UniqueKey
items have an ec_is_const == true. However, that means we have to loop
through the ec_members of the EquivalenceClasses in the uniquekeys
during the DISTINCT check. That's particularly bad when you consider
that in a partitioned table case there might be an ec_member for each
child partition and there could be 1000s of child partitions and
following those ec_members chains is going to be too slow.

My current thoughts are that we should be using EquivalenceClasses but
we should first add some infrastructure to make them perform better.
My current thoughts are that we do something like what I mentioned in
[1]: /messages/by-id/CAApHDvrEXcadNYAAdq6RO0eKZUG6rRHXJGAbpzj8y432gCD9bA@mail.gmail.com
we could either make EquivalenceClass.ec_members a hash table or
binary search tree. Or even perhaps just have a single hash table/BST
for all EquivalenceClasses that allows very fast lookups from {Expr}
-> {EquivalenceClass}. I think an Expr can only belong in a single
non-merged EquivalenceClass. So when we do merging of
EquivalenceClasses we could just repoint that data structure to point
to the new EquivalenceClass. We'd never point to ones that have
ec_merged != NULL. This would also allow us to fix the poor
performance in regards to get_eclass_for_sort_expr() for partitioned
tables.

So, it seems the patch dependency chain for skip scans just got a bit longer :-(

David

#17

Andy Fan

zhihui.fan1213@gmail.com

over 5 years ago

In reply to: David Rowley (#16)

Re: Index Skip Scan (new UniqueKeys)

Hi David:

Thanks for looking into this.

On Fri, Jul 31, 2020 at 11:07 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Mon, 13 Jul 2020 at 10:18, Floris Van Nee <florisvannee@optiver.com>
wrote:

One question about the unique keys - probably for Andy or David: I've

looked in the archives to find arguments for/against using Expr nodes or
EquivalenceClasses in the Unique Keys patch. However, I couldn't really
find a clear answer about why the current patch uses Expr rather than
EquivalenceClasses. At some point David mentioned "that probably Expr nodes
were needed rather than EquivalenceClasses", but it's not really clear to
me why. What were the thoughts behind this?

I'm still not quite sure on this either way. I did think
EquivalenceClasses were more suitable before I wrote the POC patch for
unique keys. But after that, I had in mind that Exprs might be
better. The reason I thought this was due to the fact that the
DISTINCT clause list is a bunch of Exprs and if the UniqueKeys were
EquivalenceClasses then checking to see if the DISTINCT can be skipped
turned into something more complex that required looking through lists
of ec_members rather than just checking if the uniquekey exprs were a
subset of the DISTINCT clause.

Thinking about it a bit harder, if we did use Exprs then it would mean
it a case like the following wouldn't work for Andy's DISTINCT no-op
stuff.

CREATE TABLE xy (x int primary key, y int not null);

SELECT DISTINCT y FROM xy WHERE x=y;

whereas if we use EquivalenceClasses then we'll find that we have an
EC with x,y in it and can skip the DISTINCT since we have a UniqueKey
containing that EquivalenceClass.

Also, looking at what Andy wrote to make a case like the following
work in his populate_baserel_uniquekeys() function in the 0002 patch:

CREATE TABLE ab (a int, b int, primary key(a,b));
SELECT DISTINCT a FROM ab WHERE b = 1;

it's a bit uninspiring. Really what we want here when checking if we
can skip doing the DISTINCT is a UniqueKey set using
EquivalenceClasses as we can just insist that any unmatched UniqueKey
items have an ec_is_const == true. However, that means we have to loop
through the ec_members of the EquivalenceClasses in the uniquekeys
during the DISTINCT check. That's particularly bad when you consider
that in a partitioned table case there might be an ec_member for each
child partition and there could be 1000s of child partitions and
following those ec_members chains is going to be too slow.

My current thoughts are that we should be using EquivalenceClasses but
we should first add some infrastructure to make them perform better.
My current thoughts are that we do something like what I mentioned in
[1] or something more like what Andres mentions in [2]. After that,
we could either make EquivalenceClass.ec_members a hash table or
binary search tree. Or even perhaps just have a single hash table/BST
for all EquivalenceClasses that allows very fast lookups from {Expr}
-> {EquivalenceClass}. I think an Expr can only belong in a single
non-merged EquivalenceClass. So when we do merging of
EquivalenceClasses we could just repoint that data structure to point
to the new EquivalenceClass. We'd never point to ones that have
ec_merged != NULL. This would also allow us to fix the poor
performance in regards to get_eclass_for_sort_expr() for partitioned
tables.

So, it seems the patch dependency chain for skip scans just got a bit
longer :-(

I admit that EquivalenceClasses has a better expressive power. There are
2 more
cases we can handle better with EquivalenceClasses. SELECT DISTINCT a, b,
c
FROM t WHERE a = b; Currently the UniqueKey is (a, b, c), but it is better
be (a, c)
and (b, c). The other case happens similarly in group by case.

After realizing this, I am still hesitant to do that, due to the
complexity. If we do that,
we may have to maintain a EquivalenceClasses in one more place or make the
existing
EquivalenceClasses List longer, for example: SELECT pk FROM t; The
current
infrastructure doesn't create any EquivalenceClasses for pk. So we have to
create
a new one in this case and reuse some existing ones in other cases.
Finally since the
EquivalenceClasses is not so straight to upper user, we have to depend on
the
infrastructure change to look up an EquivalenceClasses quickly from an
Expr.

I rethink more about the case you provide above, IIUC, there is such issue
for joinrel.
then we can just add a EC checking for populate_baserel_uniquekeys. As for
the
DISTINCT/GROUP BY case, we should build the UniqueKeys from
root->distinct_pathkeys
and root->group_pathkeys where the EquivalenceClasses are already there.

I am still not insisting on either Expr or EquivalenceClasses right now,
if we need to
change it to EquivalenceClasses, I'd see if we need to have more places to
take
care before doing that.

--
Best Regards
Andy Fan

#18

9erthalion6@gmail.com

over 5 years ago

In reply to: Dmitry Dolgov (#15)

6 attachment(s)

Re: Index Skip Scan (new UniqueKeys)

On Mon, Jul 27, 2020 at 12:24:31PM +0200, Dmitry Dolgov wrote:

I see, thanks for clarification. You're right, in this part of
implementation there is no way to give up if LSN changes like
_bt_killitems does. As far as I can see the leaf page is already pinned
all the time between reading relevant tuples and comparing them, I only
need to handle posting list tuples.

Here is a new version that hopefully address most of the concerns
mentioned in this thread so far. As before, first two patches are taken
from UniqueKeys thread and attached only for the reference. List of
changes includes:

* fix for index scan not being fully covered
* rebase on the latest UniqueKey patch
* taking into account posting tuples (although I must say I couldn't
produce a test that will hit this part, so I would appreciate if
someone can take a look)
* fixes suggested by Floris with adjustments as discussed in the thread

There are no changes related to EquivalenceClasses vs expressions, which
would probably be my next target. Having this in mind I must admit I'm
not super excited about possibility of including another patch as a
dependency without clear prospects and plans for it.

Thanks for the feedback folks!

Attachments:

v36-0001-Introduce-RelOptInfo-notnullattrs-attribute.patchtext/x-diff; charset=us-asciiDownload

From 67510f70b0cd515c4dd1c935c6d58ad372cd6c58 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Sun, 3 May 2020 22:37:46 +0800
Subject: [PATCH v36 1/6] Introduce RelOptInfo->notnullattrs attribute

The notnullattrs is calculated from catalog and run-time query. That
infomation is translated to child relation as well for partitioned
table.
---
 src/backend/optimizer/path/allpaths.c  | 31 ++++++++++++++++++++++++++
 src/backend/optimizer/plan/initsplan.c | 10 +++++++++
 src/backend/optimizer/util/plancat.c   | 10 +++++++++
 src/include/nodes/pathnodes.h          |  2 ++
 4 files changed, 53 insertions(+)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 6da0dcd61c..484dab0a1a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1005,6 +1005,7 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 		RelOptInfo *childrel;
 		ListCell   *parentvars;
 		ListCell   *childvars;
+		int i = -1;
 
 		/* append_rel_list contains all append rels; ignore others */
 		if (appinfo->parent_relid != parentRTindex)
@@ -1061,6 +1062,36 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 								   (Node *) rel->reltarget->exprs,
 								   1, &appinfo);
 
+		/* Copy notnullattrs. */
+		while ((i = bms_next_member(rel->notnullattrs, i)) > 0)
+		{
+			AttrNumber attno = i + FirstLowInvalidHeapAttributeNumber;
+			AttrNumber child_attno;
+			if (attno == 0)
+			{
+				/* Whole row is not null, so must be same for child */
+				childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+														attno - FirstLowInvalidHeapAttributeNumber);
+				break;
+			}
+			if (attno < 0 )
+				/* no need to translate system column */
+				child_attno = attno;
+			else
+			{
+				Node * node = list_nth(appinfo->translated_vars, attno - 1);
+				if (!IsA(node, Var))
+					/* This may happens at UNION case, like (SELECT a FROM t1 UNION SELECT a + 3
+					 * FROM t2) t and we know t.a is not null
+					 */
+					continue;
+				child_attno = castNode(Var, node)->varattno;
+			}
+
+			childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+													child_attno - FirstLowInvalidHeapAttributeNumber);
+		}
+
 		/*
 		 * We have to make child entries in the EquivalenceClass data
 		 * structures as well.  This is needed either if the parent
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index e978b491f6..95b1b14cd3 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -830,6 +830,16 @@ deconstruct_recurse(PlannerInfo *root, Node *jtnode, bool below_outer_join,
 		{
 			Node	   *qual = (Node *) lfirst(l);
 
+			/* Set the not null info now */
+			ListCell	*lc;
+			List		*non_nullable_vars = find_nonnullable_vars(qual);
+			foreach(lc, non_nullable_vars)
+			{
+				Var *var = lfirst_node(Var, lc);
+				RelOptInfo *rel = root->simple_rel_array[var->varno];
+				rel->notnullattrs = bms_add_member(rel->notnullattrs,
+												   var->varattno - FirstLowInvalidHeapAttributeNumber);
+			}
 			distribute_qual_to_rels(root, qual,
 									false, below_outer_join, JOIN_INNER,
 									root->qual_security_level,
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 25545029d7..0b2f9d398a 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -117,6 +117,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	Relation	relation;
 	bool		hasindex;
 	List	   *indexinfos = NIL;
+	int			i;
 
 	/*
 	 * We need not lock the relation since it was already locked, either by
@@ -463,6 +464,15 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	if (inhparent && relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		set_relation_partition_info(root, rel, relation);
 
+	Assert(rel->notnullattrs == NULL);
+	for(i = 0; i < relation->rd_att->natts; i++)
+	{
+		FormData_pg_attribute attr = relation->rd_att->attrs[i];
+		if (attr.attnotnull)
+			rel->notnullattrs = bms_add_member(rel->notnullattrs,
+											   attr.attnum - FirstLowInvalidHeapAttributeNumber);
+	}
+
 	table_close(relation, NoLock);
 
 	/*
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 485d1b06c9..9e3ebd488a 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -709,6 +709,8 @@ typedef struct RelOptInfo
 	PlannerInfo *subroot;		/* if subquery */
 	List	   *subplan_params; /* if subquery */
 	int			rel_parallel_workers;	/* wanted number of parallel workers */
+	/* Not null attrs, start from -FirstLowInvalidHeapAttributeNumber */
+	Bitmapset		*notnullattrs;
 
 	/* Information about foreign tables and foreign joins */
 	Oid			serverid;		/* identifies server for the table or join */
-- 
2.21.0

v36-0002-Introduce-UniqueKey-attributes-on-RelOptInfo-str.patchtext/x-diff; charset=us-asciiDownload

From 1b178b109f4c45d4f32b7feff8a0cccb9d6bd525 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Mon, 11 May 2020 15:50:52 +0800
Subject: [PATCH v36 2/6] Introduce UniqueKey attributes on RelOptInfo struct.

UniqueKey is a set of exprs on RelOptInfo which represents the exprs
will be unique on the given RelOptInfo. You can see README.uniquekey
for more information.
---
 src/backend/nodes/copyfuncs.c               |   13 +
 src/backend/nodes/list.c                    |   31 +
 src/backend/nodes/makefuncs.c               |   13 +
 src/backend/nodes/outfuncs.c                |   11 +
 src/backend/nodes/readfuncs.c               |   10 +
 src/backend/optimizer/path/Makefile         |    3 +-
 src/backend/optimizer/path/README.uniquekey |  131 +++
 src/backend/optimizer/path/allpaths.c       |   10 +
 src/backend/optimizer/path/joinpath.c       |    9 +-
 src/backend/optimizer/path/joinrels.c       |    2 +
 src/backend/optimizer/path/pathkeys.c       |    3 +-
 src/backend/optimizer/path/uniquekeys.c     | 1131 +++++++++++++++++++
 src/backend/optimizer/plan/planner.c        |   13 +-
 src/backend/optimizer/prep/prepunion.c      |    2 +
 src/backend/optimizer/util/appendinfo.c     |   44 +
 src/backend/optimizer/util/inherit.c        |   16 +-
 src/include/nodes/makefuncs.h               |    3 +
 src/include/nodes/nodes.h                   |    1 +
 src/include/nodes/pathnodes.h               |   29 +-
 src/include/nodes/pg_list.h                 |    2 +
 src/include/optimizer/appendinfo.h          |    3 +
 src/include/optimizer/optimizer.h           |    2 +
 src/include/optimizer/paths.h               |   43 +
 23 files changed, 1502 insertions(+), 23 deletions(-)
 create mode 100644 src/backend/optimizer/path/README.uniquekey
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 89c409de66..1f50400fd2 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2273,6 +2273,16 @@ _copyPathKey(const PathKey *from)
 	return newnode;
 }
 
+static UniqueKey *
+_copyUniqueKey(const UniqueKey *from)
+{
+	UniqueKey	*newnode = makeNode(UniqueKey);
+
+	COPY_NODE_FIELD(exprs);
+	COPY_SCALAR_FIELD(multi_nullvals);
+
+	return newnode;
+}
 /*
  * _copyRestrictInfo
  */
@@ -5152,6 +5162,9 @@ copyObjectImpl(const void *from)
 		case T_PathKey:
 			retval = _copyPathKey(from);
 			break;
+		case T_UniqueKey:
+			retval = _copyUniqueKey(from);
+			break;
 		case T_RestrictInfo:
 			retval = _copyRestrictInfo(from);
 			break;
diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index 80fa8c84e4..a7a99b70f2 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -687,6 +687,37 @@ list_member_oid(const List *list, Oid datum)
 	return false;
 }
 
+/*
+ * return true iff every entry in "members" list is also present
+ * in the "target" list.
+ */
+bool
+list_is_subset(const List *members, const List *target)
+{
+	const ListCell	*lc1, *lc2;
+
+	Assert(IsPointerList(members));
+	Assert(IsPointerList(target));
+	check_list_invariants(members);
+	check_list_invariants(target);
+
+	foreach(lc1, members)
+	{
+		bool found = false;
+		foreach(lc2, target)
+		{
+			if (equal(lfirst(lc1), lfirst(lc2)))
+			{
+				found = true;
+				break;
+			}
+		}
+		if (!found)
+			return false;
+	}
+	return true;
+}
+
 /*
  * Delete the n'th cell (counting from 0) in list.
  *
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 49de285f01..646cf7c9a1 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -814,3 +814,16 @@ makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols)
 	v->va_cols = va_cols;
 	return v;
 }
+
+
+/*
+ * makeUniqueKey
+ */
+UniqueKey*
+makeUniqueKey(List *exprs, bool multi_nullvals)
+{
+	UniqueKey * ukey = makeNode(UniqueKey);
+	ukey->exprs = exprs;
+	ukey->multi_nullvals = multi_nullvals;
+	return ukey;
+}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515d..c3a9632992 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2428,6 +2428,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+	WRITE_NODE_FIELD(exprs);
+	WRITE_BOOL_FIELD(multi_nullvals);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4127,6 +4135,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 				_outPathKey(str, obj);
 				break;
+			case T_UniqueKey:
+				_outUniqueKey(str, obj);
+				break;
 			case T_PathTarget:
 				_outPathTarget(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab719..3a18571d0c 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -452,6 +452,14 @@ _readSetOperationStmt(void)
 	READ_DONE();
 }
 
+static UniqueKey *
+_readUniqueKey(void)
+{
+	READ_LOCALS(UniqueKey);
+	READ_NODE_FIELD(exprs);
+	READ_BOOL_FIELD(multi_nullvals);
+	READ_DONE();
+}
 
 /*
  *	Stuff from primnodes.h.
@@ -2656,6 +2664,8 @@ parseNodeString(void)
 		return_value = _readCommonTableExpr();
 	else if (MATCH("SETOPERATIONSTMT", 16))
 		return_value = _readSetOperationStmt();
+	else if (MATCH("UNIQUEKEY", 9))
+		return_value = _readUniqueKey();
 	else if (MATCH("ALIAS", 5))
 		return_value = _readAlias();
 	else if (MATCH("RANGEVAR", 8))
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..7b9820c25f 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekeys.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/README.uniquekey b/src/backend/optimizer/path/README.uniquekey
new file mode 100644
index 0000000000..5eac761995
--- /dev/null
+++ b/src/backend/optimizer/path/README.uniquekey
@@ -0,0 +1,131 @@
+1. What is UniqueKey?
+We can think UniqueKey is a set of exprs for a RelOptInfo, which we are insure
+that doesn't yields same result among all the rows. The simplest UniqueKey
+format is primary key.
+
+However we define the UnqiueKey as below.
+
+typedef struct UniqueKey
+{
+        NodeTag	type;
+        List	*exprs;
+        bool	multi_nullvals;
+} UniqueKey;
+
+exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+is a special case of UniqueKey, which means there is only one row in that
+relation.it has a stronger semantic than others. like SELECT uk FROM t; uk is
+normal unique key and may have different values. SELECT colx FROM t WHERE uk =
+const.  colx is unique AND we have only 1 value. This field can used for
+innerrel_is_unique. this logic is handled specially in add_uniquekey_for_onerow
+function.
+
+multi_nullvals: true means multi null values may exist in these exprs, so the
+uniqueness is not guaranteed in this case. This field is necessary for
+remove_useless_join & reduce_unique_semijoins where we don't mind these
+duplicated NULL values. It is set to true for 2 cases. One is a unique key
+from a unique index but the related column is nullable. The other one is for
+outer join. see populate_joinrel_uniquekeys for detail.
+
+
+The UniqueKey can be used at the following cases at least:
+1. remove_useless_joins.
+2. reduce_semianti_joins
+3. remove distinct node if distinct clause is unique.
+4. remove aggnode if group by clause is unique.
+5. Index Skip Scan (WIP)
+6. Aggregation Push Down without 2 phase aggregation if the join can't
+   duplicated the aggregated rows. (work in progress feature)
+
+2. How is it maintained?
+
+We have a set of populate_xxx_unqiuekeys functions to maintain the uniquekey on
+various cases. xxx includes baserel, joinrel, partitionedrel, distinctrel,
+groupedrel, unionrel. and we also need to convert the uniquekey from subquery
+to outer relation, which is what convert_subquery_uniquekeys does.
+
+1. The first part is about baserel. We handled 3 cases. suppose we have Unique
+Index on (a, b).
+
+1. SELECT a, b FROM t.  UniqueKey (a, b)
+2. SELECT a FROM t WHERE b = 1;  UniqueKey (a)
+3. SELECT .. FROM t WHERE a = 1 AND b = 1;  UniqueKey (NIL).  onerow case, every
+   column is Unique.
+
+2. The next part is joinrel, this part is most error-prone, we simplified the rules
+like below:
+1. If the relation's UniqueKey can't be duplicated after join,  then is will be
+   still valid for the join rel. The function we used here is
+   innerrel_keeps_unique. The basic idea is innerrel.any_col = outer.uk.
+
+2. If the UnqiueKey can't keep valid via the rule 1, the combination of the
+   UniqueKey from both sides are valid for sure.  We can prove this as: if the
+   unique exprs from rel1 is duplicated by rel2, the duplicated rows must
+   contains different unique exprs from rel2.
+
+More considerations about onerow:
+1. If relation with one row and it can't be duplicated, it is still possible
+   contains mulit_nullvas after outer join.
+2. If the either UniqueKey can be duplicated after join, the can get one row
+   only when both side is one row AND there is no outer join.
+3. Whenever the onerow UniqueKey is not a valid any more, we need to convert one
+   row UniqueKey to normal unique key since we don't store exprs for one-row
+   relation. get_exprs_from_uniquekeys will be used here.
+
+
+More considerations about multi_nullvals after join:
+1. If the original UnqiueKey has multi_nullvals, the final UniqueKey will have
+   mulit_nullvals in any case.
+2. If a unique key doesn't allow mulit_nullvals, after some outer join, it
+   allows some outer join.
+
+
+3. When we comes to subquery, we need to convert_subquery_unqiuekeys just like
+convert_subquery_pathkeys.  Only the UniqueKey insides subquery is referenced as
+a Var in outer relation will be reused. The relationship between the outerrel.Var
+and subquery.exprs is built with outerel->subroot->processed_tlist.
+
+
+4. As for the SRF functions, it will break the uniqueness of uniquekey, However it
+is handled in adjust_paths_for_srfs, which happens after the query_planner. so
+we will maintain the UniqueKey until there and reset it to NIL at that
+places. This can't help on distinct/group by elimination cases but probably help
+in some other cases, like reduce_unqiue_semijoins/remove_useless_joins and it is
+semantic correctly.
+
+
+5. As for inherit table, we first main the UnqiueKey on childrel as well. But for
+partitioned table we need to maintain 2 different kinds of
+UnqiueKey. 1). UniqueKey on the parent relation 2). UniqueKey on child
+relation for partition wise query.
+
+Example:
+CREATE TABLE p (a int not null, b int not null) partition by list (a);
+CREATE TABLE p0 partition of p for values in (1);
+CREATE TABLE p1 partition of p for values in (2);
+
+create unique index p0_b on p0(b);
+create unique index p1_b on p1(b);
+
+Now b is only unique on partition level, so the distinct can't be removed on
+the following cases. SELECT DISTINCT b FROM p;
+
+Another example is SELECT DISTINCT a, b FROM p WHERE a = 1; Since only one
+partition is chosen, the UniqueKey on child relation is same as the UniqueKey on
+parent relation.
+
+Another usage of UniqueKey on partition level is it be helpful for
+partition-wise join.
+
+As for the UniqueKey on parent table level, it comes with 2 different ways,
+1). the UniqueKey is also derived in UniqueKey index, but the index must be same
+in all the related children relations and the unique index must contains
+Partition Key in it. Example:
+
+CREATE UNIQUE INDEX p_ab ON p(a, b);  -- where a is the partition key.
+
+-- Query
+SELECT a, b FROM p; the (a, b) is a UniqueKey of p.
+
+2). If the parent relation has only one childrel, the UniqueKey on childrel is
+ the UniqueKey on parent as well.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 484dab0a1a..2ad9d06d7a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -579,6 +579,12 @@ set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	 */
 	check_index_predicates(root, rel);
 
+	/*
+	 * Now that we've marked which partial indexes are suitable, we can now
+	 * build the relation's unique keys.
+	 */
+	populate_baserel_uniquekeys(root, rel, rel->indexlist);
+
 	/* Mark rel with estimated output rows, width, etc */
 	set_baserel_size_estimates(root, rel);
 }
@@ -1310,6 +1316,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 
 	/* Add paths to the append relation. */
 	add_paths_to_append_rel(root, rel, live_childrels);
+	if (IS_PARTITIONED_REL(rel))
+		populate_partitionedrel_uniquekeys(root, rel, live_childrels);
 }
 
 
@@ -2383,6 +2391,8 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 										  pathkeys, required_outer));
 	}
 
+	convert_subquery_uniquekeys(root, rel, sub_final_rel);
+
 	/* If outer rel allows parallelism, do same for partial paths. */
 	if (rel->consider_parallel && bms_is_empty(required_outer))
 	{
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index db54a6ba2e..ef0fd2fb0b 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -71,13 +71,6 @@ static void consider_parallel_mergejoin(PlannerInfo *root,
 static void hash_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 								 RelOptInfo *outerrel, RelOptInfo *innerrel,
 								 JoinType jointype, JoinPathExtraData *extra);
-static List *select_mergejoin_clauses(PlannerInfo *root,
-									  RelOptInfo *joinrel,
-									  RelOptInfo *outerrel,
-									  RelOptInfo *innerrel,
-									  List *restrictlist,
-									  JoinType jointype,
-									  bool *mergejoin_allowed);
 static void generate_mergejoin_paths(PlannerInfo *root,
 									 RelOptInfo *joinrel,
 									 RelOptInfo *innerrel,
@@ -1927,7 +1920,7 @@ hash_inner_and_outer(PlannerInfo *root,
  * if it is mergejoinable and involves vars from the two sub-relations
  * currently of interest.
  */
-static List *
+List *
 select_mergejoin_clauses(PlannerInfo *root,
 						 RelOptInfo *joinrel,
 						 RelOptInfo *outerrel,
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 2d343cd293..b9163ee8ff 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -924,6 +924,8 @@ populate_joinrel_with_paths(PlannerInfo *root, RelOptInfo *rel1,
 
 	/* Apply partitionwise join technique, if possible. */
 	try_partitionwise_join(root, rel1, rel2, joinrel, sjinfo, restrictlist);
+
+	populate_joinrel_uniquekeys(root, joinrel, rel1, rel2, restrictlist, sjinfo->jointype);
 }
 
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ce9bf87e9b..7e596d4194 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -33,7 +33,6 @@ static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
 											 int partkeycol);
-static Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 static bool right_merge_direction(PlannerInfo *root, PathKey *pathkey);
 
 
@@ -1035,7 +1034,7 @@ convert_subquery_pathkeys(PlannerInfo *root, RelOptInfo *rel,
  * We need this to ensure that we don't return pathkeys describing values
  * that are unavailable above the level of the subquery scan.
  */
-static Var *
+Var *
 find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle)
 {
 	ListCell   *lc;
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
new file mode 100644
index 0000000000..b33bcd2f32
--- /dev/null
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -0,0 +1,1131 @@
+/*-------------------------------------------------------------------------
+ *
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/uniquekeys.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/appendinfo.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/tlist.h"
+#include "rewrite/rewriteManip.h"
+
+
+/*
+ * This struct is used to help populate_joinrel_uniquekeys.
+ *
+ * added_to_joinrel is true if a uniquekey (from outerrel or innerrel)
+ * has been added to joinrel.
+ * useful is true if the exprs of the uniquekey still appears in joinrel.
+ */
+typedef struct UniqueKeyContextData
+{
+	UniqueKey	*uniquekey;
+	bool	added_to_joinrel;
+	bool	useful;
+} *UniqueKeyContext;
+
+static List *initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel);
+static bool innerrel_keeps_unique(PlannerInfo *root,
+								  RelOptInfo *outerrel,
+								  RelOptInfo *innerrel,
+								  List *restrictlist,
+								  bool reverse);
+
+static List *get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+										List *const_exprs,
+										List *const_expr_opfamilies,
+										Bitmapset *used_varattrs,
+										bool *useful,
+										bool *multi_nullvals);
+static List *get_exprs_from_uniquekey(RelOptInfo *joinrel,
+									  RelOptInfo *rel1,
+									  UniqueKey *ukey);
+static void add_uniquekey_for_onerow(RelOptInfo *rel);
+static bool add_combined_uniquekey(RelOptInfo *joinrel,
+								   RelOptInfo *outer_rel,
+								   RelOptInfo *inner_rel,
+								   UniqueKey *outer_ukey,
+								   UniqueKey *inner_ukey,
+								   JoinType jointype);
+
+/* Used for unique indexes checking for partitioned table */
+static bool index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind);
+static IndexOptInfo *simple_copy_indexinfo_to_parent(PlannerInfo *root,
+													 RelOptInfo *parentrel,
+													 IndexOptInfo *from);
+static bool simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2);
+static void adjust_partition_unique_indexlist(PlannerInfo *root,
+											  RelOptInfo *parentrel,
+											  RelOptInfo *childrel,
+											  List **global_unique_index);
+
+/* Helper function for grouped relation and distinct relation. */
+static void add_uniquekey_from_sortgroups(PlannerInfo *root,
+										  RelOptInfo *rel,
+										  List *sortgroups);
+
+/*
+ * populate_baserel_uniquekeys
+ *		Populate 'baserel' uniquekeys list by looking at the rel's unique index
+ * and baserestrictinfo
+ */
+void
+populate_baserel_uniquekeys(PlannerInfo *root,
+							RelOptInfo *baserel,
+							List *indexlist)
+{
+	ListCell *lc;
+	List	*matched_uniq_indexes = NIL;
+
+	/* Attrs appears in rel->reltarget->exprs. */
+	Bitmapset *used_attrs = NULL;
+
+	List	*const_exprs = NIL;
+	List	*expr_opfamilies = NIL;
+
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	foreach(lc, indexlist)
+	{
+		IndexOptInfo *ind = (IndexOptInfo *) lfirst(lc);
+		if (!ind->unique || !ind->immediate ||
+			(ind->indpred != NIL && !ind->predOK))
+			continue;
+		matched_uniq_indexes = lappend(matched_uniq_indexes, ind);
+	}
+
+	if (matched_uniq_indexes  == NIL)
+		return;
+
+	/* Check which attrs is used in baserel->reltarget */
+	pull_varattnos((Node *)baserel->reltarget->exprs, baserel->relid, &used_attrs);
+
+	/* Check which attrno is used at a mergeable const filter */
+	foreach(lc, baserel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+		if (rinfo->mergeopfamilies == NIL)
+			continue;
+
+		if (bms_is_empty(rinfo->left_relids))
+		{
+			const_exprs = lappend(const_exprs, get_rightop(rinfo->clause));
+		}
+		else if (bms_is_empty(rinfo->right_relids))
+		{
+			const_exprs = lappend(const_exprs, get_leftop(rinfo->clause));
+		}
+		else
+			continue;
+
+		expr_opfamilies = lappend(expr_opfamilies, rinfo->mergeopfamilies);
+	}
+
+	foreach(lc, matched_uniq_indexes)
+	{
+		bool	multi_nullvals, useful;
+		List	*exprs = get_exprs_from_uniqueindex(lfirst_node(IndexOptInfo, lc),
+													const_exprs,
+													expr_opfamilies,
+													used_attrs,
+													&useful,
+													&multi_nullvals);
+		if (useful)
+		{
+			if (exprs == NIL)
+			{
+				/* All the columns in Unique Index matched with a restrictinfo */
+				add_uniquekey_for_onerow(baserel);
+				return;
+			}
+			baserel->uniquekeys = lappend(baserel->uniquekeys,
+										  makeUniqueKey(exprs, multi_nullvals));
+		}
+	}
+}
+
+
+/*
+ * populate_partitionedrel_uniquekeys
+ * The UniqueKey on partitionrel comes from 2 cases:
+ * 1). Only one partition is involved in this query, the unique key can be
+ * copied to parent rel from childrel.
+ * 2). There are some unique index which includes partition key and exists
+ * in all the related partitions.
+ * We never mind rule 2 if we hit rule 1.
+ */
+
+void
+populate_partitionedrel_uniquekeys(PlannerInfo *root,
+								   RelOptInfo *rel,
+								   List *childrels)
+{
+	ListCell	*lc;
+	List	*global_uniq_indexlist = NIL;
+	RelOptInfo *childrel;
+	bool is_first = true;
+
+	Assert(IS_PARTITIONED_REL(rel));
+
+	if (childrels == NIL)
+		return;
+
+	/*
+	 * If there is only one partition used in this query, the UniqueKey in childrel is
+	 * still valid in parent level, but we need convert the format from child expr to
+	 * parent expr.
+	 */
+	if (list_length(childrels) == 1)
+	{
+		/* Check for Rule 1 */
+		RelOptInfo *childrel = linitial_node(RelOptInfo, childrels);
+		ListCell	*lc;
+		Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL);
+		if (relation_is_onerow(childrel))
+		{
+			add_uniquekey_for_onerow(rel);
+			return;
+		}
+
+		foreach(lc, childrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			AppendRelInfo *appinfo = find_appinfo_by_child(root, childrel->relid);
+			List *parent_exprs = NIL;
+			bool can_reuse = true;
+			ListCell	*lc2;
+			foreach(lc2, ukey->exprs)
+			{
+				Var *var = (Var *)lfirst(lc2);
+				/*
+				 * If the expr comes from a expression, it is hard to build the expression
+				 * in parent so ignore that case for now.
+				 */
+				if(!IsA(var, Var))
+				{
+					can_reuse = false;
+					break;
+				}
+				/* Convert it to parent var */
+				parent_exprs = lappend(parent_exprs, find_parent_var(appinfo, var));
+			}
+			if (can_reuse)
+				rel->uniquekeys = lappend(rel->uniquekeys,
+										  makeUniqueKey(parent_exprs,
+														ukey->multi_nullvals));
+		}
+	}
+	else
+	{
+		/* Check for rule 2 */
+		childrel = linitial_node(RelOptInfo, childrels);
+		foreach(lc, childrel->indexlist)
+		{
+			IndexOptInfo *ind = lfirst(lc);
+			IndexOptInfo *modified_index;
+			if (!ind->unique || !ind->immediate ||
+				(ind->indpred != NIL && !ind->predOK))
+				continue;
+
+			/*
+			 * During simple_copy_indexinfo_to_parent, we need to convert var from
+			 * child var to parent var, index on expression is too complex to handle.
+			 * so ignore it for now.
+			 */
+			if (ind->indexprs != NIL)
+				continue;
+
+			modified_index = simple_copy_indexinfo_to_parent(root, rel, ind);
+			/*
+			 * If the unique index doesn't contain partkey, then it is unique
+			 * on this partition only, so it is useless for us.
+			 */
+			if (!index_constains_partkey(rel, modified_index))
+				continue;
+
+			global_uniq_indexlist = lappend(global_uniq_indexlist,  modified_index);
+		}
+
+		if (global_uniq_indexlist != NIL)
+		{
+			foreach(lc, childrels)
+			{
+				RelOptInfo *child = lfirst(lc);
+				if (is_first)
+				{
+					is_first = false;
+					continue;
+				}
+				adjust_partition_unique_indexlist(root, rel, child, &global_uniq_indexlist);
+			}
+			/* Now we have a list of unique index which are exactly same on all childrels,
+			 * Set the UniqueKey just like it is non-partition table
+			 */
+			populate_baserel_uniquekeys(root, rel, global_uniq_indexlist);
+		}
+	}
+}
+
+
+/*
+ * populate_distinctrel_uniquekeys
+ */
+void
+populate_distinctrel_uniquekeys(PlannerInfo *root,
+								RelOptInfo *inputrel,
+								RelOptInfo *distinctrel)
+{
+	/* The unique key before the distinct is still valid. */
+	distinctrel->uniquekeys = list_copy(inputrel->uniquekeys);
+	add_uniquekey_from_sortgroups(root, distinctrel, root->parse->distinctClause);
+}
+
+/*
+ * populate_grouprel_uniquekeys
+ */
+void
+populate_grouprel_uniquekeys(PlannerInfo *root,
+							 RelOptInfo *grouprel,
+							 RelOptInfo *inputrel)
+
+{
+	Query *parse = root->parse;
+	bool input_ukey_added = false;
+	ListCell *lc;
+
+	if (relation_is_onerow(inputrel))
+	{
+		add_uniquekey_for_onerow(grouprel);
+		return;
+	}
+	if (parse->groupingSets)
+		return;
+
+	/* A Normal group by without grouping set. */
+	if (parse->groupClause)
+	{
+		/*
+		 * Current even the groupby clause is Unique already, but if query has aggref
+		 * We have to create grouprel still. To keep the UnqiueKey short, we will check
+		 * the UniqueKey of input_rel still valid, if so we reuse it.
+		 */
+		foreach(lc, inputrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(ukey->exprs, grouprel->reltarget->exprs))
+			{
+				grouprel->uniquekeys = lappend(grouprel->uniquekeys,
+											   ukey);
+				input_ukey_added = true;
+			}
+		}
+		if (!input_ukey_added)
+			/*
+			 * group by clause must be a super-set of grouprel->reltarget->exprs except the
+			 * aggregation expr, so if such exprs is unique already, no bother to generate
+			 * new uniquekey for group by exprs.
+			 */
+			add_uniquekey_from_sortgroups(root,
+										  grouprel,
+										  root->parse->groupClause);
+	}
+	else
+		/* It has aggregation but without a group by, so only one row returned */
+		add_uniquekey_for_onerow(grouprel);
+}
+
+/*
+ * simple_copy_uniquekeys
+ * Using a function for the one-line code makes us easy to check where we simply
+ * copied the uniquekey.
+ */
+void
+simple_copy_uniquekeys(RelOptInfo *oldrel,
+					   RelOptInfo *newrel)
+{
+	newrel->uniquekeys = oldrel->uniquekeys;
+}
+
+/*
+ *  populate_unionrel_uniquekeys
+ */
+void
+populate_unionrel_uniquekeys(PlannerInfo *root,
+							  RelOptInfo *unionrel)
+{
+	ListCell	*lc;
+	List	*exprs = NIL;
+
+	Assert(unionrel->uniquekeys == NIL);
+
+	foreach(lc, unionrel->reltarget->exprs)
+	{
+		exprs = lappend(exprs, lfirst(lc));
+	}
+
+	if (exprs == NIL)
+		/* SQL: select union select; is valid, we need to handle it here. */
+		add_uniquekey_for_onerow(unionrel);
+	else
+		unionrel->uniquekeys = lappend(unionrel->uniquekeys,
+									   makeUniqueKey(exprs,false));
+
+}
+
+/*
+ * populate_joinrel_uniquekeys
+ *
+ * populate uniquekeys for joinrel. We will check each relation to see if its
+ * UniqueKey is still valid via innerrel_keeps_unique, if so, we add it to
+ * joinrel.  The multi_nullvals field will be changed to true for some outer
+ * join cases and one-row UniqueKey needs to be converted to normal UniqueKey
+ * for the same case as well.
+ * For the uniquekey in either baserel which can't be unique after join, we still
+ * check to see if combination of UniqueKeys from both side is still useful for us.
+ * if yes, we add it to joinrel as well.
+ */
+void
+populate_joinrel_uniquekeys(PlannerInfo *root, RelOptInfo *joinrel,
+							RelOptInfo *outerrel, RelOptInfo *innerrel,
+							List *restrictlist, JoinType jointype)
+{
+	ListCell *lc, *lc2;
+	List	*clause_list = NIL;
+	List	*outerrel_ukey_ctx;
+	List	*innerrel_ukey_ctx;
+	bool	inner_onerow, outer_onerow;
+	bool	mergejoin_allowed;
+
+	/* Care about the outerrel relation only for SEMI/ANTI join */
+	if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
+	{
+		foreach(lc, outerrel->uniquekeys)
+		{
+			UniqueKey	*uniquekey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(uniquekey->exprs, joinrel->reltarget->exprs))
+				joinrel->uniquekeys = lappend(joinrel->uniquekeys, uniquekey);
+		}
+		return;
+	}
+
+	Assert(jointype == JOIN_LEFT || jointype == JOIN_FULL || jointype == JOIN_INNER);
+
+	/* Fast path */
+	if (innerrel->uniquekeys == NIL || outerrel->uniquekeys == NIL)
+		return;
+
+	inner_onerow = relation_is_onerow(innerrel);
+	outer_onerow = relation_is_onerow(outerrel);
+
+	outerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(outerrel);
+	innerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(innerrel);
+
+	clause_list = select_mergejoin_clauses(root, joinrel, outerrel, innerrel,
+										   restrictlist, jointype,
+										   &mergejoin_allowed);
+
+	if (innerrel_keeps_unique(root, innerrel, outerrel, clause_list, true /* reverse */))
+	{
+		bool outer_impact = jointype == JOIN_FULL;
+		foreach(lc, outerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			/* Outer relation has one row, and the unique key is not duplicated after join,
+			 * the joinrel will still has one row unless the jointype == JOIN_FULL.
+			 */
+			if (outer_onerow && !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (outer_onerow)
+			{
+				/*
+				 * The onerow outerrel becomes multi rows and multi_nullvals
+				 * will be changed to true. We also need to set the exprs correctly since it
+				 * can't be NIL any more.
+				 */
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(joinrel, outerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Change multi_nullvals to true due to the full join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs, true));
+				else
+					/* Just reuse it */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	if (innerrel_keeps_unique(root, outerrel, innerrel, clause_list, false))
+	{
+		bool outer_impact = jointype == JOIN_FULL || jointype == JOIN_LEFT;;
+
+		foreach(lc, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			if (inner_onerow &&  !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (inner_onerow)
+			{
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(joinrel, innerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Need to change multi_nullvals to true due to the outer join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs,
+																true));
+				else
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	/*
+	 * The combination of the UniqueKey from both sides is unique as well regardless
+	 * of join type, but no bother to add it if its subset has been added to joinrel
+	 * already or it is not useful for the joinrel.
+	 */
+	foreach(lc, outerrel_ukey_ctx)
+	{
+		UniqueKeyContext ctx1 = (UniqueKeyContext) lfirst(lc);
+		if (ctx1->added_to_joinrel || !ctx1->useful)
+			continue;
+		foreach(lc2, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx2 = (UniqueKeyContext) lfirst(lc2);
+			if (ctx2->added_to_joinrel || !ctx2->useful)
+				continue;
+			if (add_combined_uniquekey(joinrel, outerrel, innerrel,
+									   ctx1->uniquekey, ctx2->uniquekey,
+									   jointype))
+				/* If we set a onerow UniqueKey to joinrel, we don't need other. */
+				return;
+		}
+	}
+}
+
+
+/*
+ * convert_subquery_uniquekeys
+ *
+ * Covert the UniqueKey in subquery to outer relation.
+ */
+void convert_subquery_uniquekeys(PlannerInfo *root,
+								 RelOptInfo *currel,
+								 RelOptInfo *sub_final_rel)
+{
+	ListCell	*lc;
+
+	if (sub_final_rel->uniquekeys == NIL)
+		return;
+
+	if (relation_is_onerow(sub_final_rel))
+	{
+		add_uniquekey_for_onerow(currel);
+		return;
+	}
+
+	Assert(currel->subroot != NULL);
+
+	foreach(lc, sub_final_rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		ListCell	*lc;
+		List	*exprs = NIL;
+		bool	ukey_useful = true;
+
+		/* One row case is handled above */
+		Assert(ukey->exprs != NIL);
+		foreach(lc, ukey->exprs)
+		{
+			Var *var;
+			TargetEntry *tle = tlist_member(lfirst(lc),
+											currel->subroot->processed_tlist);
+			if (tle == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			var = find_var_for_subquery_tle(currel, tle);
+			if (var == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			exprs = lappend(exprs, var);
+		}
+
+		if (ukey_useful)
+			currel->uniquekeys = lappend(currel->uniquekeys,
+										 makeUniqueKey(exprs,
+													   ukey->multi_nullvals));
+
+	}
+}
+
+/*
+ * innerrel_keeps_unique
+ *
+ * Check if Unique key of the innerrel is valid after join. innerrel's UniqueKey
+ * will be still valid if innerrel's any-column mergeop outrerel's uniquekey
+ * exists in clause_list.
+ *
+ * Note: the clause_list must be a list of mergeable restrictinfo already.
+ */
+static bool
+innerrel_keeps_unique(PlannerInfo *root,
+					  RelOptInfo *outerrel,
+					  RelOptInfo *innerrel,
+					  List *clause_list,
+					  bool reverse)
+{
+	ListCell	*lc, *lc2, *lc3;
+
+	if (outerrel->uniquekeys == NIL || innerrel->uniquekeys == NIL)
+		return false;
+
+	/* Check if there is outerrel's uniquekey in mergeable clause. */
+	foreach(lc, outerrel->uniquekeys)
+	{
+		List	*outer_uq_exprs = lfirst_node(UniqueKey, lc)->exprs;
+		bool clauselist_matchs_all_exprs = true;
+		foreach(lc2, outer_uq_exprs)
+		{
+			Node *outer_uq_expr = lfirst(lc2);
+			bool find_uq_expr_in_clauselist = false;
+			foreach(lc3, clause_list)
+			{
+				RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc3);
+				Node *outer_expr;
+				if (reverse)
+					outer_expr = rinfo->outer_is_left ? get_rightop(rinfo->clause) : get_leftop(rinfo->clause);
+				else
+					outer_expr = rinfo->outer_is_left ? get_leftop(rinfo->clause) : get_rightop(rinfo->clause);
+				if (equal(outer_expr, outer_uq_expr))
+				{
+					find_uq_expr_in_clauselist = true;
+					break;
+				}
+			}
+			if (!find_uq_expr_in_clauselist)
+			{
+				/* No need to check the next exprs in the current uniquekey */
+				clauselist_matchs_all_exprs = false;
+				break;
+			}
+		}
+
+		if (clauselist_matchs_all_exprs)
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * relation_is_onerow
+ * Check if it is a one-row relation by checking UniqueKey.
+ */
+bool
+relation_is_onerow(RelOptInfo *rel)
+{
+	UniqueKey *ukey;
+	if (rel->uniquekeys == NIL)
+		return false;
+	ukey = linitial_node(UniqueKey, rel->uniquekeys);
+	return ukey->exprs == NIL && list_length(rel->uniquekeys) == 1;
+}
+
+/*
+ * relation_has_uniquekeys_for
+ *		Returns true if we have proofs that 'rel' cannot return multiple rows with
+ *		the same values in each of 'exprs'.  Otherwise returns false.
+ */
+bool
+relation_has_uniquekeys_for(PlannerInfo *root, RelOptInfo *rel,
+							List *exprs, bool allow_multinulls)
+{
+	ListCell *lc;
+
+	/*
+	 * For UniqueKey->onerow case, the uniquekey->exprs is empty as well
+	 * so we can't rely on list_is_subset to handle this special cases
+	 */
+	if (exprs == NIL)
+		return false;
+
+	foreach(lc, rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		if (ukey->multi_nullvals && !allow_multinulls)
+			continue;
+		if (list_is_subset(ukey->exprs, exprs))
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * get_exprs_from_uniqueindex
+ *
+ * Return a list of exprs which is unique. set useful to false if this
+ * unique index is not useful for us.
+ */
+static List *
+get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+						   List *const_exprs,
+						   List *const_expr_opfamilies,
+						   Bitmapset *used_varattrs,
+						   bool *useful,
+						   bool *multi_nullvals)
+{
+	List	*exprs = NIL;
+	ListCell	*indexpr_item;
+	int	c = 0;
+
+	*useful = true;
+	*multi_nullvals = false;
+
+	indexpr_item = list_head(unique_index->indexprs);
+	for(c = 0; c < unique_index->ncolumns; c++)
+	{
+		int attr = unique_index->indexkeys[c];
+		Expr *expr;
+		bool	matched_const = false;
+		ListCell	*lc1, *lc2;
+
+		if(attr > 0)
+		{
+			expr = list_nth_node(TargetEntry, unique_index->indextlist, c)->expr;
+		}
+		else if (attr == 0)
+		{
+			/* Expression index */
+			expr = lfirst(indexpr_item);
+			indexpr_item = lnext(unique_index->indexprs, indexpr_item);
+		}
+		else /* attr < 0 */
+		{
+			/* Index on system column is not supported */
+			Assert(false);
+		}
+
+		/*
+		 * Check index_col = Const case with regarding to opfamily checking
+		 * If we can remove the index_col from the final UniqueKey->exprs.
+		 */
+		forboth(lc1, const_exprs, lc2, const_expr_opfamilies)
+		{
+			if (list_member_oid((List *)lfirst(lc2), unique_index->opfamily[c])
+				&& match_index_to_operand((Node *) lfirst(lc1), c, unique_index))
+			{
+				matched_const = true;
+				break;
+			}
+		}
+
+		if (matched_const)
+			continue;
+
+		/* Check if the indexed expr is used in rel */
+		if (attr > 0)
+		{
+			/*
+			 * Normal Indexed column, if the col is not used, then the index is useless
+			 * for uniquekey.
+			 */
+			attr -= FirstLowInvalidHeapAttributeNumber;
+
+			if (!bms_is_member(attr, used_varattrs))
+			{
+				*useful = false;
+				break;
+			}
+		}
+		else if (!list_member(unique_index->rel->reltarget->exprs, expr))
+		{
+			/* Expression index but the expression is not used in rel */
+			*useful = false;
+			break;
+		}
+
+		/* check not null property. */
+		if (attr == 0)
+		{
+			/* We never know if a expression yields null or not */
+			*multi_nullvals = true;
+		}
+		else if (!bms_is_member(attr, unique_index->rel->notnullattrs)
+				 && !bms_is_member(0 - FirstLowInvalidHeapAttributeNumber,
+								   unique_index->rel->notnullattrs))
+		{
+			*multi_nullvals = true;
+		}
+
+		exprs = lappend(exprs, expr);
+	}
+	return exprs;
+}
+
+
+/*
+ * add_uniquekey_for_onerow
+ * If we are sure that the relation only returns one row, then all the columns
+ * are unique. However we don't need to create UniqueKey for every column, we
+ * just set exprs = NIL and overwrites all the other UniqueKey on this RelOptInfo
+ * since this one has strongest semantics.
+ */
+void
+add_uniquekey_for_onerow(RelOptInfo *rel)
+{
+	/*
+	 * We overwrite the previous UniqueKey on purpose since this one has the
+	 * strongest semantic.
+	 */
+	rel->uniquekeys = list_make1(makeUniqueKey(NIL, false));
+}
+
+
+/*
+ * initililze_uniquecontext_for_joinrel
+ * Return a List of UniqueKeyContext for an inputrel
+ */
+static List *
+initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel)
+{
+	List	*res = NIL;
+	ListCell *lc;
+	foreach(lc,  inputrel->uniquekeys)
+	{
+		UniqueKeyContext context;
+		context = palloc(sizeof(struct UniqueKeyContextData));
+		context->uniquekey = lfirst_node(UniqueKey, lc);
+		context->added_to_joinrel = false;
+		context->useful = true;
+		res = lappend(res, context);
+	}
+	return res;
+}
+
+
+/*
+ * get_exprs_from_uniquekey
+ *	Unify the way of get List of exprs from a one-row UniqueKey or
+ * normal UniqueKey. for the onerow case, every expr in rel1 is a valid
+ * UniqueKey. Return a List of exprs.
+ *
+ * rel1: The relation which you want to get the exprs.
+ * ukey: The UniqueKey you want to get the exprs.
+ */
+static List *
+get_exprs_from_uniquekey(RelOptInfo *joinrel, RelOptInfo *rel1, UniqueKey *ukey)
+{
+	ListCell *lc;
+	bool onerow = rel1 != NULL && relation_is_onerow(rel1);
+
+	List	*res = NIL;
+	Assert(onerow || ukey);
+	if (onerow)
+	{
+		/* Only cares about the exprs still exist in joinrel */
+		foreach(lc, joinrel->reltarget->exprs)
+		{
+			Bitmapset *relids = pull_varnos(lfirst(lc));
+			if (bms_is_subset(relids, rel1->relids))
+			{
+				res = lappend(res, list_make1(lfirst(lc)));
+			}
+		}
+	}
+	else
+	{
+		res = list_make1(ukey->exprs);
+	}
+	return res;
+}
+
+/*
+ * Partitioned table Unique Keys.
+ * The partition table unique key is maintained as:
+ * 1. The index must be unique as usual.
+ * 2. The index must contains partition key.
+ * 3. The index must exist on all the child rel. see simple_indexinfo_equal for
+ *    how we compare it.
+ */
+
+/*
+ * index_constains_partkey
+ * return true if the index contains the partiton key.
+ */
+static bool
+index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind)
+{
+	ListCell	*lc;
+	int	i;
+	Assert(IS_PARTITIONED_REL(partrel));
+	Assert(partrel->part_scheme->partnatts > 0);
+
+	for(i = 0; i < partrel->part_scheme->partnatts; i++)
+	{
+		Node *part_expr = linitial(partrel->partexprs[i]);
+		bool found_in_index = false;
+		foreach(lc, ind->indextlist)
+		{
+			Expr *index_expr = lfirst_node(TargetEntry, lc)->expr;
+			if (equal(index_expr, part_expr))
+			{
+				found_in_index = true;
+				break;
+			}
+		}
+		if (!found_in_index)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * simple_indexinfo_equal
+ *
+ * Used to check if the 2 index is same as each other. The index here
+ * is COPIED from childrel and did some tiny changes(see
+ * simple_copy_indexinfo_to_parent)
+ */
+static bool
+simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2)
+{
+	Size oid_cmp_len = sizeof(Oid) * ind1->ncolumns;
+
+	return ind1->ncolumns == ind2->ncolumns &&
+		ind1->unique == ind2->unique &&
+		memcmp(ind1->indexkeys, ind2->indexkeys, sizeof(int) * ind1->ncolumns) == 0 &&
+		memcmp(ind1->opfamily, ind2->opfamily, oid_cmp_len) == 0 &&
+		memcmp(ind1->opcintype, ind2->opcintype, oid_cmp_len) == 0 &&
+		memcmp(ind1->sortopfamily, ind2->sortopfamily, oid_cmp_len) == 0 &&
+		equal(get_tlist_exprs(ind1->indextlist, true),
+			  get_tlist_exprs(ind2->indextlist, true));
+}
+
+
+/*
+ * The below macros are used for simple_copy_indexinfo_to_parent which is so
+ * customized that I don't want to put it to copyfuncs.c. So copy it here.
+ */
+#define COPY_POINTER_FIELD(fldname, sz) \
+	do { \
+		Size	_size = (sz); \
+		newnode->fldname = palloc(_size); \
+		memcpy(newnode->fldname, from->fldname, _size); \
+	} while (0)
+
+#define COPY_NODE_FIELD(fldname) \
+	(newnode->fldname = copyObjectImpl(from->fldname))
+
+#define COPY_SCALAR_FIELD(fldname) \
+	(newnode->fldname = from->fldname)
+
+
+/*
+ * simple_copy_indexinfo_to_parent (from partition)
+ * Copy the IndexInfo from child relation to parent relation with some modification,
+ * which is used to test:
+ * 1. If the same index exists in all the childrels.
+ * 2. If the parentrel->reltarget/basicrestrict info matches this index.
+ */
+static IndexOptInfo *
+simple_copy_indexinfo_to_parent(PlannerInfo *root,
+								RelOptInfo *parentrel,
+								IndexOptInfo *from)
+{
+	IndexOptInfo *newnode = makeNode(IndexOptInfo);
+	AppendRelInfo *appinfo = find_appinfo_by_child(root, from->rel->relid);
+	ListCell	*lc;
+	int	idx = 0;
+
+	COPY_SCALAR_FIELD(ncolumns);
+	COPY_SCALAR_FIELD(nkeycolumns);
+	COPY_SCALAR_FIELD(unique);
+	COPY_SCALAR_FIELD(immediate);
+	/* We just need to know if it is NIL or not */
+	COPY_SCALAR_FIELD(indpred);
+	COPY_SCALAR_FIELD(predOK);
+	COPY_POINTER_FIELD(indexkeys, from->ncolumns * sizeof(int));
+	COPY_POINTER_FIELD(indexcollations, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opfamily, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opcintype, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(sortopfamily, from->ncolumns * sizeof(Oid));
+	COPY_NODE_FIELD(indextlist);
+
+	/* Convert index exprs on child expr to expr on parent */
+	foreach(lc, newnode->indextlist)
+	{
+		TargetEntry *tle = lfirst_node(TargetEntry, lc);
+		/* Index on expression is ignored */
+		Assert(IsA(tle->expr, Var));
+		tle->expr = (Expr *) find_parent_var(appinfo, (Var *) tle->expr);
+		newnode->indexkeys[idx] = castNode(Var, tle->expr)->varattno;
+		idx++;
+	}
+	newnode->rel = parentrel;
+	return newnode;
+}
+
+/*
+ * adjust_partition_unique_indexlist
+ *
+ * global_unique_indexes: At the beginning, it contains the copy & modified
+ * unique index from the first partition. And then check if each index in it still
+ * exists in the following partitions. If no, remove it. at last, it has an
+ * index list which exists in all the partitions.
+ */
+static void
+adjust_partition_unique_indexlist(PlannerInfo *root,
+								  RelOptInfo *parentrel,
+								  RelOptInfo *childrel,
+								  List **global_unique_indexes)
+{
+	ListCell	*lc, *lc2;
+	foreach(lc, *global_unique_indexes)
+	{
+		IndexOptInfo	*g_ind = lfirst_node(IndexOptInfo, lc);
+		bool found_in_child = false;
+
+		foreach(lc2, childrel->indexlist)
+		{
+			IndexOptInfo   *p_ind = lfirst_node(IndexOptInfo, lc2);
+			IndexOptInfo   *p_ind_copy;
+			if (!p_ind->unique || !p_ind->immediate ||
+				(p_ind->indpred != NIL && !p_ind->predOK))
+				continue;
+			p_ind_copy = simple_copy_indexinfo_to_parent(root, parentrel, p_ind);
+			if (simple_indexinfo_equal(p_ind_copy, g_ind))
+			{
+				found_in_child = true;
+				break;
+			}
+		}
+		if (!found_in_child)
+			/* The index doesn't exist in childrel, remove it from global_unique_indexes */
+			*global_unique_indexes = foreach_delete_current(*global_unique_indexes, lc);
+	}
+}
+
+/* Helper function for groupres/distinctrel */
+static void
+add_uniquekey_from_sortgroups(PlannerInfo *root, RelOptInfo *rel, List *sortgroups)
+{
+	Query *parse = root->parse;
+	List	*exprs;
+
+	/*
+	 * XXX: If there are some vars which is not in current levelsup, the semantic is
+	 * imprecise, should we avoid it or not? levelsup = 1 is just a demo, maybe we need to
+	 * check every level other than 0, if so, looks we have to write another
+	 * pull_var_walker.
+	 */
+	List	*upper_vars = pull_vars_of_level((Node*)sortgroups, 1);
+
+	if (upper_vars != NIL)
+		return;
+
+	exprs = get_sortgrouplist_exprs(sortgroups, parse->targetList);
+	rel->uniquekeys = lappend(rel->uniquekeys,
+							  makeUniqueKey(exprs,
+											false /* sortgroupclause can't be multi_nullvals */));
+}
+
+
+/*
+ * add_combined_uniquekey
+ * The combination of both UniqueKeys is a valid UniqueKey for joinrel no matter
+ * the jointype.
+ */
+bool
+add_combined_uniquekey(RelOptInfo *joinrel,
+					   RelOptInfo *outer_rel,
+					   RelOptInfo *inner_rel,
+					   UniqueKey *outer_ukey,
+					   UniqueKey *inner_ukey,
+					   JoinType jointype)
+{
+
+	ListCell	*lc1, *lc2;
+
+	/* Either side has multi_nullvals or we have outer join,
+	 * the combined UniqueKey has multi_nullvals */
+	bool multi_nullvals = outer_ukey->multi_nullvals ||
+		inner_ukey->multi_nullvals || IS_OUTER_JOIN(jointype);
+
+	/* The only case we can get onerow joinrel after join */
+	if  (relation_is_onerow(outer_rel)
+		 && relation_is_onerow(inner_rel)
+		 && jointype == JOIN_INNER)
+	{
+		add_uniquekey_for_onerow(joinrel);
+		return true;
+	}
+
+	foreach(lc1, get_exprs_from_uniquekey(joinrel, outer_rel, outer_ukey))
+	{
+		foreach(lc2, get_exprs_from_uniquekey(joinrel, inner_rel, inner_ukey))
+		{
+			List *exprs = list_concat_copy(lfirst_node(List, lc1), lfirst_node(List, lc2));
+			joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+										  makeUniqueKey(exprs,
+														multi_nullvals));
+		}
+	}
+	return false;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b406d41e91..0551ae0512 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2389,6 +2389,8 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 		add_path(final_rel, path);
 	}
 
+	simple_copy_uniquekeys(current_rel, final_rel);
+
 	/*
 	 * Generate partial paths for final_rel, too, if outer query levels might
 	 * be able to make use of them.
@@ -3899,6 +3901,8 @@ create_grouping_paths(PlannerInfo *root,
 	}
 
 	set_cheapest(grouped_rel);
+
+	populate_grouprel_uniquekeys(root, grouped_rel, input_rel);
 	return grouped_rel;
 }
 
@@ -4615,7 +4619,7 @@ create_window_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(window_rel);
-
+	simple_copy_uniquekeys(input_rel, window_rel);
 	return window_rel;
 }
 
@@ -4911,7 +4915,7 @@ create_distinct_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(distinct_rel);
-
+	populate_distinctrel_uniquekeys(root, input_rel, distinct_rel);
 	return distinct_rel;
 }
 
@@ -5172,6 +5176,8 @@ create_ordered_paths(PlannerInfo *root,
 	 */
 	Assert(ordered_rel->pathlist != NIL);
 
+	simple_copy_uniquekeys(input_rel, ordered_rel);
+
 	return ordered_rel;
 }
 
@@ -6049,6 +6055,9 @@ adjust_paths_for_srfs(PlannerInfo *root, RelOptInfo *rel,
 	if (list_length(targets) == 1)
 		return;
 
+	/* UniqueKey is not valid after handling the SRF. */
+	rel->uniquekeys = NIL;
+
 	/*
 	 * Stack SRF-evaluation nodes atop each path for the rel.
 	 *
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 951aed80e7..e94e92937c 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -689,6 +689,8 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/* Undo effects of possibly forcing tuple_fraction to 0 */
 	root->tuple_fraction = save_fraction;
 
+	/* Add the UniqueKeys */
+	populate_unionrel_uniquekeys(root, result_rel);
 	return result_rel;
 }
 
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index d722063cf3..44c37ecffc 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -746,3 +746,47 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
 	}
 	return appinfos;
 }
+
+/*
+ * find_appinfo_by_child
+ *
+ */
+AppendRelInfo *
+find_appinfo_by_child(PlannerInfo *root, Index child_index)
+{
+	ListCell	*lc;
+	foreach(lc, root->append_rel_list)
+	{
+		AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
+		if (appinfo->child_relid == child_index)
+			return appinfo;
+	}
+	elog(ERROR, "parent relation cant be found");
+	return NULL;
+}
+
+/*
+ * find_parent_var
+ *
+ */
+Var *
+find_parent_var(AppendRelInfo *appinfo, Var *child_var)
+{
+	ListCell	*lc;
+	Var	*res = NULL;
+	Index attno = 1;
+	foreach(lc, appinfo->translated_vars)
+	{
+		Node *child_node = lfirst(lc);
+		if (equal(child_node, child_var))
+		{
+			res = copyObject(child_var);
+			res->varattno = attno;
+			res->varno = appinfo->parent_relid;
+		}
+		attno++;
+	}
+	if (res == NULL)
+		elog(ERROR, "parent var can't be found.");
+	return res;
+}
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 3132fd35a5..d66b40ec50 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -736,6 +736,7 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 		{
 			Node	   *onecq = (Node *) lfirst(lc2);
 			bool		pseudoconstant;
+			RestrictInfo	*child_rinfo;
 
 			/* check for pseudoconstant (no Vars or volatile functions) */
 			pseudoconstant =
@@ -747,13 +748,14 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 				root->hasPseudoConstantQuals = true;
 			}
 			/* reconstitute RestrictInfo with appropriate properties */
-			childquals = lappend(childquals,
-								 make_restrictinfo((Expr *) onecq,
-												   rinfo->is_pushed_down,
-												   rinfo->outerjoin_delayed,
-												   pseudoconstant,
-												   rinfo->security_level,
-												   NULL, NULL, NULL));
+			child_rinfo =  make_restrictinfo((Expr *) onecq,
+											 rinfo->is_pushed_down,
+											 rinfo->outerjoin_delayed,
+											 pseudoconstant,
+											 rinfo->security_level,
+											 NULL, NULL, NULL);
+			child_rinfo->mergeopfamilies = rinfo->mergeopfamilies;
+			childquals = lappend(childquals, child_rinfo);
 			/* track minimum security level among child quals */
 			cq_min_security = Min(cq_min_security, rinfo->security_level);
 		}
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 31d9aedeeb..c83f17acb7 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -16,6 +16,7 @@
 
 #include "nodes/execnodes.h"
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 
 extern A_Expr *makeA_Expr(A_Expr_Kind kind, List *name,
@@ -105,4 +106,6 @@ extern GroupingSet *makeGroupingSet(GroupingSetKind kind, List *content, int loc
 
 extern VacuumRelation *makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols);
 
+extern UniqueKey* makeUniqueKey(List *exprs, bool multi_nullvals);
+
 #endif							/* MAKEFUNC_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 381d84b4e4..41110ed888 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -264,6 +264,7 @@ typedef enum NodeTag
 	T_EquivalenceMember,
 	T_PathKey,
 	T_PathTarget,
+	T_UniqueKey,
 	T_RestrictInfo,
 	T_IndexClause,
 	T_PlaceHolderVar,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 9e3ebd488a..02e4458bef 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -730,6 +730,7 @@ typedef struct RelOptInfo
 	QualCost	baserestrictcost;	/* cost of evaluating the above */
 	Index		baserestrict_min_security;	/* min security_level found in
 											 * baserestrictinfo */
+	List	   *uniquekeys;		/* List of UniqueKey */
 	List	   *joininfo;		/* RestrictInfo structures for join clauses
 								 * involving this rel */
 	bool		has_eclass_joins;	/* T means joininfo is incomplete */
@@ -1047,6 +1048,28 @@ typedef struct PathKey
 } PathKey;
 
 
+/*
+ * UniqueKey
+ *
+ * Represents the unique properties held by a RelOptInfo.
+ *
+ * exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+ * is a special case of UniqueKey, which means there is only 1 row in that
+ * relation.
+ * multi_nullvals: true means multi null values may exist in these exprs, so the
+ * uniqueness is not guaranteed in this case. This field is necessary for
+ * remove_useless_join & reduce_unique_semijoins where we don't mind these
+ * duplicated NULL values. It is set to true for 2 cases. One is a unique key
+ * from a unique index but the related column is nullable. The other one is for
+ * outer join. see populate_joinrel_uniquekeys for detail.
+ */
+typedef struct UniqueKey
+{
+	NodeTag		type;
+	List	   *exprs;
+	bool		multi_nullvals;
+} UniqueKey;
+
 /*
  * PathTarget
  *
@@ -2473,7 +2496,7 @@ typedef enum
  *
  * flags indicating what kinds of grouping are possible.
  * partial_costs_set is true if the agg_partial_costs and agg_final_costs
- * 		have been initialized.
+ *		have been initialized.
  * agg_partial_costs gives partial aggregation costs.
  * agg_final_costs gives finalization costs.
  * target_parallel_safe is true if target is parallel safe.
@@ -2503,8 +2526,8 @@ typedef struct
  * limit_tuples is an estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate.
  * count_est and offset_est are the estimated values of the LIMIT and OFFSET
- * 		expressions computed by preprocess_limit() (see comments for
- * 		preprocess_limit() for more information).
+ *		expressions computed by preprocess_limit() (see comments for
+ *		preprocess_limit() for more information).
  */
 typedef struct
 {
diff --git a/src/include/nodes/pg_list.h b/src/include/nodes/pg_list.h
index 14ea2766ad..621f54a9f8 100644
--- a/src/include/nodes/pg_list.h
+++ b/src/include/nodes/pg_list.h
@@ -528,6 +528,8 @@ extern bool list_member_ptr(const List *list, const void *datum);
 extern bool list_member_int(const List *list, int datum);
 extern bool list_member_oid(const List *list, Oid datum);
 
+extern bool list_is_subset(const List *members, const List *target);
+
 extern List *list_delete(List *list, void *datum);
 extern List *list_delete_ptr(List *list, void *datum);
 extern List *list_delete_int(List *list, int datum);
diff --git a/src/include/optimizer/appendinfo.h b/src/include/optimizer/appendinfo.h
index d6a27a60dd..e87c92a054 100644
--- a/src/include/optimizer/appendinfo.h
+++ b/src/include/optimizer/appendinfo.h
@@ -32,4 +32,7 @@ extern Relids adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
 extern AppendRelInfo **find_appinfos_by_relids(PlannerInfo *root,
 											   Relids relids, int *nappinfos);
 
+extern AppendRelInfo *find_appinfo_by_child(PlannerInfo *root, Index child_index);
+extern Var *find_parent_var(AppendRelInfo *appinfo, Var *child_var);
+
 #endif							/* APPENDINFO_H */
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index 3e4171056e..9445141263 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -23,6 +23,7 @@
 #define OPTIMIZER_H
 
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 /*
  * We don't want to include nodes/pathnodes.h here, because non-planner
@@ -156,6 +157,7 @@ extern TargetEntry *get_sortgroupref_tle(Index sortref,
 										 List *targetList);
 extern TargetEntry *get_sortgroupclause_tle(SortGroupClause *sgClause,
 											List *targetList);
+extern Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 extern Node *get_sortgroupclause_expr(SortGroupClause *sgClause,
 									  List *targetList);
 extern List *get_sortgrouplist_exprs(List *sgClauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 10b6e81079..9217a8d6c6 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -240,5 +240,48 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 									   int strategy, bool nulls_first);
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 									List *live_childrels);
+extern List *select_mergejoin_clauses(PlannerInfo *root,
+									  RelOptInfo *joinrel,
+									  RelOptInfo *outerrel,
+									  RelOptInfo *innerrel,
+									  List *restrictlist,
+									  JoinType jointype,
+									  bool *mergejoin_allowed);
+
+/*
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ */
+extern void populate_baserel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *baserel,
+										List* unique_index_list);
+extern void populate_partitionedrel_uniquekeys(PlannerInfo *root,
+												RelOptInfo *rel,
+												List *childrels);
+extern void populate_distinctrel_uniquekeys(PlannerInfo *root,
+											RelOptInfo *inputrel,
+											RelOptInfo *distinctrel);
+extern void populate_grouprel_uniquekeys(PlannerInfo *root,
+										 RelOptInfo *grouprel,
+										 RelOptInfo *inputrel);
+extern void populate_unionrel_uniquekeys(PlannerInfo *root,
+										  RelOptInfo *unionrel);
+extern void simple_copy_uniquekeys(RelOptInfo *oldrel,
+								   RelOptInfo *newrel);
+extern void convert_subquery_uniquekeys(PlannerInfo *root,
+										RelOptInfo *currel,
+										RelOptInfo *sub_final_rel);
+extern void populate_joinrel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *joinrel,
+										RelOptInfo *rel1,
+										RelOptInfo *rel2,
+										List *restrictlist,
+										JoinType jointype);
+
+extern bool relation_has_uniquekeys_for(PlannerInfo *root,
+										RelOptInfo *rel,
+										List *exprs,
+										bool allow_multinulls);
+extern bool relation_is_onerow(RelOptInfo *rel);
 
 #endif							/* PATHS_H */
-- 
2.21.0

v36-0003-Extend-UniqueKeys.patchtext/x-diff; charset=us-asciiDownload

From a74b8f5d96904589b154f263570c9fc9043d68c2 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:33:56 +0200
Subject: [PATCH v36 3/6] Extend UniqueKeys

Prepares index skip scan implementation using UniqueKeys. Allows to
specify what are the "requested" keys that should be unique, and add
them to necessary Paths to make them useful later.

Proposed by David Rowley, contains few bits out of previous version from
Jesper Pedersen.
---
 src/backend/optimizer/path/pathkeys.c   | 62 +++++++++++++++++++++++++
 src/backend/optimizer/path/uniquekeys.c | 61 ++++++++++++++++++++++++
 src/backend/optimizer/plan/planner.c    | 36 +++++++++++++-
 src/backend/optimizer/util/pathnode.c   | 32 +++++++++----
 src/include/nodes/pathnodes.h           |  5 ++
 src/include/optimizer/pathnode.h        |  1 +
 src/include/optimizer/paths.h           |  8 ++++
 7 files changed, 195 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 7e596d4194..97e1c2de83 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
 #include "utils/lsyscache.h"
 
 
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
@@ -95,6 +96,29 @@ make_canonical_pathkey(PlannerInfo *root,
 	return pk;
 }
 
+/*
+ * pathkey_is_unique
+ *	   Checks if the new pathkey's equivalence class is the same as that of
+ *     any existing member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+	EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+	ListCell   *lc;
+
+	/* If same EC already is already in the list, then not unique */
+	foreach(lc, pathkeys)
+	{
+		PathKey    *old_pathkey = (PathKey *) lfirst(lc);
+
+		if (new_ec == old_pathkey->pk_eclass)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * pathkey_is_redundant
  *	   Is a pathkey redundant with one already in the given list?
@@ -1151,6 +1175,44 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
 	return pathkeys;
 }
 
+/*
+ * make_pathkeys_for_uniquekeyclauses
+ *		Generate a pathkeys list to be used for uniquekey clauses
+ */
+List *
+make_pathkeys_for_uniquekeys(PlannerInfo *root,
+							 List *sortclauses,
+							 List *tlist)
+{
+	List	   *pathkeys = NIL;
+	ListCell   *l;
+
+	foreach(l, sortclauses)
+	{
+		SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+		Expr	   *sortkey;
+		PathKey    *pathkey;
+
+		sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+		Assert(OidIsValid(sortcl->sortop));
+		pathkey = make_pathkey_from_sortop(root,
+										   sortkey,
+										   root->nullable_baserels,
+										   sortcl->sortop,
+										   sortcl->nulls_first,
+										   sortcl->tleSortGroupRef,
+										   true);
+
+		if (EC_MUST_BE_REDUNDANT(pathkey->pk_eclass))
+			continue;
+
+		if (pathkey_is_unique(pathkey, pathkeys))
+			pathkeys = lappend(pathkeys, pathkey);
+	}
+
+	return pathkeys;
+}
+
 /****************************************************************************
  *		PATHKEYS AND MERGECLAUSES
  ****************************************************************************/
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
index b33bcd2f32..09247e7041 100644
--- a/src/backend/optimizer/path/uniquekeys.c
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -1129,3 +1129,64 @@ add_combined_uniquekey(RelOptInfo *joinrel,
 	}
 	return false;
 }
+
+List*
+build_uniquekeys(PlannerInfo *root, List *sortclauses)
+{
+	List *result = NIL;
+	List *sortkeys;
+	ListCell *l;
+	List *exprs = NIL;
+
+	sortkeys = make_pathkeys_for_uniquekeys(root,
+											sortclauses,
+											root->processed_tlist);
+
+	/* Create a uniquekey and add it to the list */
+	foreach(l, sortkeys)
+	{
+		PathKey    *pathkey = (PathKey *) lfirst(l);
+		EquivalenceClass *ec = pathkey->pk_eclass;
+		EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members));
+		exprs = lappend(exprs, mem->em_expr);
+	}
+
+	result = lappend(result, makeUniqueKey(exprs, false));
+
+	return result;
+}
+
+bool
+query_has_uniquekeys_for(PlannerInfo *root, List *pathuniquekeys,
+						 bool allow_multinulls)
+{
+	ListCell *lc;
+	ListCell *lc2;
+
+	/* root->query_uniquekeys are the requested DISTINCT clauses on query level
+	 * pathuniquekeys are the unique keys on current path.
+	 * All requested query_uniquekeys must be satisfied by the pathuniquekeys
+	 */
+	foreach(lc, root->query_uniquekeys)
+	{
+		UniqueKey *query_ukey = lfirst_node(UniqueKey, lc);
+		bool satisfied = false;
+		foreach(lc2, pathuniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc2);
+			if (ukey->multi_nullvals && !allow_multinulls)
+				continue;
+			if (list_length(ukey->exprs) == 0 &&
+				list_length(query_ukey->exprs) != 0)
+				continue;
+			if (list_is_subset(ukey->exprs, query_ukey->exprs))
+			{
+				satisfied = true;
+				break;
+			}
+		}
+		if (!satisfied)
+			return false;
+	}
+	return true;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 0551ae0512..0d986e87e6 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3627,12 +3627,18 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 
 	if (parse->distinctClause &&
 		grouping_is_sortable(parse->distinctClause))
+	{
 		root->distinct_pathkeys =
 			make_pathkeys_for_sortclauses(root,
 										  parse->distinctClause,
 										  tlist);
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
+	{
 		root->distinct_pathkeys = NIL;
+		root->query_uniquekeys = NIL;
+	}
 
 	root->sort_pathkeys =
 		make_pathkeys_for_sortclauses(root,
@@ -4815,13 +4821,19 @@ create_distinct_paths(PlannerInfo *root,
 			Path	   *path = (Path *) lfirst(lc);
 
 			if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
-			{
 				add_path(distinct_rel, (Path *)
 						 create_upper_unique_path(root, distinct_rel,
 												  path,
 												  list_length(root->distinct_pathkeys),
 												  numDistinctRows));
-			}
+		}
+
+		foreach(lc, input_rel->unique_pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+
+			if (query_has_uniquekeys_for(root, path->uniquekeys, false))
+				add_path(distinct_rel, path);
 		}
 
 		/* For explicit-sort case, always use the more rigorous clause */
@@ -7517,6 +7529,26 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		}
 	}
 
+	foreach(lc, rel->unique_pathlist)
+	{
+		Path	   *subpath = (Path *) lfirst(lc);
+
+		/* Shouldn't have any parameterized paths anymore */
+		Assert(subpath->param_info == NULL);
+
+		if (tlist_same_exprs)
+			subpath->pathtarget->sortgrouprefs =
+				scanjoin_target->sortgrouprefs;
+		else
+		{
+			Path	   *newpath;
+
+			newpath = (Path *) create_projection_path(root, rel, subpath,
+													  scanjoin_target);
+			lfirst(lc) = newpath;
+		}
+	}
+
 	/*
 	 * Now, if final scan/join target contains SRFs, insert ProjectSetPath(s)
 	 * atop each existing path.  (Note that this function doesn't look at the
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 5110a6b806..d4abb3cb47 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -416,10 +416,10 @@ set_cheapest(RelOptInfo *parent_rel)
  * 'parent_rel' is the relation entry to which the path corresponds.
  * 'new_path' is a potential path for parent_rel.
  *
- * Returns nothing, but modifies parent_rel->pathlist.
+ * Returns modified pathlist.
  */
-void
-add_path(RelOptInfo *parent_rel, Path *new_path)
+static List *
+add_path_to(RelOptInfo *parent_rel, List *pathlist, Path *new_path)
 {
 	bool		accept_new = true;	/* unless we find a superior old path */
 	int			insert_at = 0;	/* where to insert new item */
@@ -440,7 +440,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	 * for more than one old path to be tossed out because new_path dominates
 	 * it.
 	 */
-	foreach(p1, parent_rel->pathlist)
+	foreach(p1, pathlist)
 	{
 		Path	   *old_path = (Path *) lfirst(p1);
 		bool		remove_old = false; /* unless new proves superior */
@@ -584,8 +584,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		 */
 		if (remove_old)
 		{
-			parent_rel->pathlist = foreach_delete_current(parent_rel->pathlist,
-														  p1);
+			pathlist = foreach_delete_current(pathlist, p1);
 
 			/*
 			 * Delete the data pointed-to by the deleted cell, if possible
@@ -612,8 +611,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	if (accept_new)
 	{
 		/* Accept the new path: insert it at proper place in pathlist */
-		parent_rel->pathlist =
-			list_insert_nth(parent_rel->pathlist, insert_at, new_path);
+		pathlist = list_insert_nth(pathlist, insert_at, new_path);
 	}
 	else
 	{
@@ -621,6 +619,23 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		if (!IsA(new_path, IndexPath))
 			pfree(new_path);
 	}
+
+	return pathlist;
+}
+
+void
+add_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->pathlist = add_path_to(parent_rel,
+									   parent_rel->pathlist, new_path);
+}
+
+void
+add_unique_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->unique_pathlist = add_path_to(parent_rel,
+											  parent_rel->unique_pathlist,
+											  new_path);
 }
 
 /*
@@ -2571,6 +2586,7 @@ create_projection_path(PlannerInfo *root,
 	pathnode->path.pathkeys = subpath->pathkeys;
 
 	pathnode->subpath = subpath;
+	pathnode->path.uniquekeys = subpath->uniquekeys;
 
 	/*
 	 * We might not need a separate Result node.  If the input plan node type
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 02e4458bef..a5c406bd4e 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -297,6 +297,7 @@ struct PlannerInfo
 
 	List	   *query_pathkeys; /* desired pathkeys for query_planner() */
 
+	List	   *query_uniquekeys; /* unique keys required for the query */
 	List	   *group_pathkeys; /* groupClause pathkeys, if any */
 	List	   *window_pathkeys;	/* pathkeys of bottom window, if any */
 	List	   *distinct_pathkeys;	/* distinctClause pathkeys, if any */
@@ -679,6 +680,7 @@ typedef struct RelOptInfo
 	List	   *pathlist;		/* Path structures */
 	List	   *ppilist;		/* ParamPathInfos used in pathlist */
 	List	   *partial_pathlist;	/* partial Paths */
+	List       *unique_pathlist;    /* unique Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
 	struct Path *cheapest_unique_path;
@@ -866,6 +868,7 @@ struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanskip;		/* can AM skip duplicate values? */
 	bool		amcanparallel;	/* does AM support parallel scan? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
 	void		(*amcostestimate) ();	/* AM's cost estimator */
@@ -1182,6 +1185,8 @@ typedef struct Path
 
 	List	   *pathkeys;		/* sort ordering of path's output */
 	/* pathkeys is a List of PathKey nodes; see above */
+
+	List	   *uniquekeys;	/* the unique keys, or NIL if none */
 } Path;
 
 /* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 715a24ad29..6796ad8cb7 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -27,6 +27,7 @@ extern int	compare_fractional_path_costs(Path *path1, Path *path2,
 										  double fraction);
 extern void set_cheapest(RelOptInfo *parent_rel);
 extern void add_path(RelOptInfo *parent_rel, Path *new_path);
+extern void add_unique_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_path_precheck(RelOptInfo *parent_rel,
 							  Cost startup_cost, Cost total_cost,
 							  List *pathkeys, Relids required_outer);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9217a8d6c6..0cb8030e33 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -215,6 +215,9 @@ extern List *build_join_pathkeys(PlannerInfo *root,
 extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
 										   List *sortclauses,
 										   List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+										  List *sortclauses,
+										  List *tlist);
 extern void initialize_mergeclause_eclasses(PlannerInfo *root,
 											RestrictInfo *restrictinfo);
 extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -282,6 +285,11 @@ extern bool relation_has_uniquekeys_for(PlannerInfo *root,
 										RelOptInfo *rel,
 										List *exprs,
 										bool allow_multinulls);
+extern bool query_has_uniquekeys_for(PlannerInfo *root,
+									 List *exprs,
+									 bool allow_multinulls);
 extern bool relation_is_onerow(RelOptInfo *rel);
 
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+
 #endif							/* PATHS_H */
-- 
2.21.0

v36-0004-Index-skip-scan.patchtext/x-diff; charset=us-asciiDownload

From b8acbd684341488b6afd27cc26961da5500b1015 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:39:13 +0200
Subject: [PATCH v36 4/6] Index skip scan

Implementation of basic Index Skip Scan (see Loose Index Scan in the
wiki [1]) infrastructure on top of IndexOnlyScan and IndexScan.
Introduces a new index am function amskip to allow advance past
duplicate key values in a scan. This innocently looking description
could be a bit tricky on the edge between am specific and common parts
of the implementation, mostly due to different information available at
each level, e.g. visibility. This means the common parts should apply
skipping multiple times if necessary.

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 contrib/bloom/blutils.c                       |   1 +
 doc/src/sgml/indexam.sgml                     |   1 +
 src/backend/access/brin/brin.c                |   1 +
 src/backend/access/gin/ginutil.c              |   1 +
 src/backend/access/gist/gist.c                |   1 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/index/indexam.c            |  18 ++
 src/backend/access/spgist/spgutils.c          |   1 +
 src/backend/commands/explain.c                |  25 +++
 src/backend/executor/nodeIndexonlyscan.c      |  97 ++++++++-
 src/backend/executor/nodeIndexscan.c          |  56 +++++-
 src/backend/nodes/copyfuncs.c                 |   2 +
 src/backend/nodes/outfuncs.c                  |   2 +
 src/backend/nodes/readfuncs.c                 |   2 +
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/optimizer/path/indxpath.c         | 187 +++++++++++++++++-
 src/backend/optimizer/path/pathkeys.c         |  54 ++++-
 src/backend/optimizer/plan/createplan.c       |  20 +-
 src/backend/optimizer/util/pathnode.c         |  37 ++++
 src/backend/optimizer/util/plancat.c          |   1 +
 src/backend/utils/misc/guc.c                  |   9 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/amapi.h                    |   8 +
 src/include/access/genam.h                    |   2 +
 src/include/access/sdir.h                     |   7 +
 src/include/nodes/execnodes.h                 |   6 +
 src/include/nodes/pathnodes.h                 |   4 +
 src/include/nodes/plannodes.h                 |   4 +
 src/include/optimizer/cost.h                  |   1 +
 src/include/optimizer/pathnode.h              |   4 +
 src/include/optimizer/paths.h                 |   5 +-
 src/test/regress/expected/sysviews.out        |   3 +-
 32 files changed, 546 insertions(+), 17 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index d3bf8665df..4eec21b99e 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -134,6 +134,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = blcostestimate;
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index af87f172a7..8c552c22bd 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -150,6 +150,7 @@ typedef struct IndexAmRoutine
     amendscan_function amendscan;
     ammarkpos_function ammarkpos;       /* can be NULL */
     amrestrpos_function amrestrpos;     /* can be NULL */
+    amskip_function amskip;             /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 7db3ae5ee0..e3e09efa31 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -115,6 +115,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = brincostestimate;
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index a400f1fedb..7e83a8ee58 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -66,6 +66,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = gincostestimate;
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 79fe6eb8d6..6fe9f8f5a9 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -87,6 +87,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = gistcostestimate;
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 3ec6d528e7..85d44ee58d 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -84,6 +84,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = hashcostestimate;
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 6b9750c244..c7b70997ff 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_skip		- advance past duplicate key values in a scan
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -736,6 +737,23 @@ index_can_return(Relation indexRelation, int attno)
 	return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
 }
 
+/* ----------------
+ *		index_skip
+ *
+ *		Skip past all tuples where the first 'prefix' columns have the
+ *		same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction,
+		   ScanDirection indexdir, bool scanstart, int prefix)
+{
+	SCAN_CHECKS;
+
+	return scan->indexRelation->rd_indam->amskip(scan, direction,
+												 indexdir, scanstart, prefix);
+}
+
 /* ----------------
  *		index_getprocid
  *
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 0efe05e552..0491ea172a 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -69,6 +69,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = spgcostestimate;
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a283e4d45c..3c6258761a 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -148,6 +148,7 @@ static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
 static void ExplainIndentText(ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 
 
@@ -1096,6 +1097,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
 }
 
+/*
+ * ExplainIndexSkipScanKeys -
+ *	  Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es)
+{
+	if (skipPrefixSize > 0)
+	{
+		if (es->format != EXPLAIN_FORMAT_TEXT)
+			ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+	}
+}
+
 /*
  * ExplainNode -
  *	  Appends a description of a plan tree to es->str
@@ -1433,6 +1450,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexScan  *indexscan = (IndexScan *) plan;
 
+				ExplainIndexSkipScanKeys(indexscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexscan->indexid,
 										indexscan->indexorderdir,
 										es);
@@ -1443,6 +1462,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
 
+				ExplainIndexSkipScanKeys(indexonlyscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexonlyscan->indexid,
 										indexonlyscan->indexorderdir,
 										es);
@@ -1703,6 +1724,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_IndexScan:
+			if (((IndexScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyBool("Skip scan", true, es);
 			show_scan_qual(((IndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexScan *) plan)->indexqualorig)
@@ -1716,6 +1739,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 										   planstate, es);
 			break;
 		case T_IndexOnlyScan:
+			if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyBool("Skip scan", true, es);
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 5617ac29e7..c4e4b087a7 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -41,6 +41,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/itemptr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -62,9 +63,26 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	EState	   *estate;
 	ExprContext *econtext;
 	ScanDirection direction;
+	ScanDirection readDirection;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	ItemPointerData startTid;
+	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
+
+	/*
+	 * Tells if the current position was reached via skipping. In this case
+	 * there is no nead for the index_getnext_tid
+	 */
+	bool skipped = false;
+
+	/*
+	 * Index only scan must be aware that in case of skipping we can return to
+	 * the starting point due to visibility checks. In this situation we need
+	 * to jump further, and number of skipping attempts tell us how far do we
+	 * need to do so.
+	 */
+	int skipAttempts = 0;
 
 	/*
 	 * extract necessary information from index scan node
@@ -72,7 +90,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -114,16 +132,87 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
 	}
+	else
+	{
+		ItemPointerCopy(&scandesc->xs_heaptid, &startTid);
+	}
+
+	/*
+	 * Check if we need to skip to the next key prefix, because we've been
+	 * asked to implement DISTINCT.
+	 *
+	 * When fetching a cursor in the direction opposite to a general scan
+	 * direction, the result must be what normal fetching should have
+	 * returned, but in reversed order. In other words, return the last or
+	 * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+	 * Due to that we skip also when the first tuple wasn't emitted yet, but
+	 * the directions are opposite.
+	 */
+	if (node->ioss_SkipPrefixSize > 0 &&
+		(node->ioss_FirstTupleEmitted ||
+		 ScanDirectionsAreOpposite(direction, indexonlyscan->indexorderdir)))
+	{
+		if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir,
+						!node->ioss_FirstTupleEmitted, node->ioss_SkipPrefixSize))
+		{
+			/*
+			 * Reached end of index. At this point currPos is invalidated, and
+			 * we need to reset ioss_FirstTupleEmitted, since otherwise after
+			 * going backwards, reaching the end of index, and going forward
+			 * again we apply skip again. It would be incorrect and lead to an
+			 * extra skipped item.
+			 */
+			node->ioss_FirstTupleEmitted = false;
+			return ExecClearTuple(slot);
+		}
+		else
+		{
+			skipAttempts = 1;
+			skipped = true;
+			tid = &scandesc->xs_heaptid;
+		}
+	}
+
+	readDirection = skipped ? indexonlyscan->indexorderdir : direction;
 
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (skipped || (tid = index_getnext_tid(scandesc, readDirection)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * While doing index only skip scan with advancing and reading in
+		 * different directions we can return to the same position where we
+		 * started after visibility check. Recognize such situations and skip
+		 * more.
+		 */
+		if ((readDirection != direction) &&
+			ItemPointerIsValid(&startTid) && ItemPointerEquals(&startTid, tid))
+		{
+			int i;
+			skipAttempts += 1;
+
+			for (i = 0; i < skipAttempts; i++)
+			{
+				if (!index_skip(scandesc, direction,
+								indexonlyscan->indexorderdir,
+								!node->ioss_FirstTupleEmitted,
+								node->ioss_SkipPrefixSize))
+				{
+					node->ioss_FirstTupleEmitted = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			tid = &scandesc->xs_heaptid;
+		}
+
+		skipped = false;
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -250,6 +339,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 							  ItemPointerGetBlockNumber(tid),
 							  estate->es_snapshot);
 
+		node->ioss_FirstTupleEmitted = true;
+
 		return slot;
 	}
 
@@ -504,6 +595,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+	indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->ioss_FirstTupleEmitted = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index d0a96a38e0..449aaec3ac 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,6 +85,13 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
+
+	/*
+	 * tells if the current position was reached via skipping. In this case
+	 * there is no nead for the index_getnext_tid
+	 */
+	bool skipped = false;
 
 	/*
 	 * extract necessary information from index scan node
@@ -92,7 +99,7 @@ IndexNext(IndexScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -117,6 +124,12 @@ IndexNext(IndexScanState *node)
 
 		node->iss_ScanDesc = scandesc;
 
+		/* Index skip scan assumes xs_want_itup, so set it to true */
+		if (indexscan->indexskipprefixsize > 0)
+			node->iss_ScanDesc->xs_want_itup = true;
+		else
+			node->iss_ScanDesc->xs_want_itup = false;
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -127,12 +140,48 @@ IndexNext(IndexScanState *node)
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	}
 
+	/*
+	 * Check if we need to skip to the next key prefix, because we've been
+	 * asked to implement DISTINCT.
+	 *
+	 * When fetching a cursor in the direction opposite to a general scan
+	 * direction, the result must be what normal fetching should have
+	 * returned, but in reversed order. In other words, return the last or
+	 * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+	 * Due to that we skip also when the first tuple wasn't emitted yet, but
+	 * the directions are opposite.
+	 */
+	if (node->iss_SkipPrefixSize > 0 &&
+		(node->iss_FirstTupleEmitted ||
+		 ScanDirectionsAreOpposite(direction, indexscan->indexorderdir)))
+	{
+		if (!index_skip(scandesc, direction, indexscan->indexorderdir,
+					   !node->iss_FirstTupleEmitted, node->iss_SkipPrefixSize))
+		{
+			/*
+			 * Reached end of index. At this point currPos is invalidated, and
+			 * we need to reset iss_FirstTupleEmitted, since otherwise after
+			 * going backwards, reaching the end of index, and going forward
+			 * again we apply skip again. It would be incorrect and lead to an
+			 * extra skipped item.
+			 */
+			node->iss_FirstTupleEmitted = false;
+			return ExecClearTuple(slot);
+		}
+		else
+		{
+			skipped = true;
+			index_fetch_heap(scandesc, slot);
+		}
+	}
+
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (skipped || index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
+		skipped = false;
 
 		/*
 		 * If the index was lossy, we have to recheck the index quals using
@@ -149,6 +198,7 @@ IndexNext(IndexScanState *node)
 			}
 		}
 
+		node->iss_FirstTupleEmitted = true;
 		return slot;
 	}
 
@@ -910,6 +960,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+	indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->iss_FirstTupleEmitted = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 1f50400fd2..829a1a7c6c 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -493,6 +493,7 @@ _copyIndexScan(const IndexScan *from)
 	COPY_NODE_FIELD(indexorderbyorig);
 	COPY_NODE_FIELD(indexorderbyops);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
@@ -518,6 +519,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
 	COPY_NODE_FIELD(indexorderby);
 	COPY_NODE_FIELD(indextlist);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c3a9632992..88254446ce 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -562,6 +562,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
 	WRITE_NODE_FIELD(indexorderbyorig);
 	WRITE_NODE_FIELD(indexorderbyops);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
@@ -576,6 +577,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
 	WRITE_NODE_FIELD(indexorderby);
 	WRITE_NODE_FIELD(indextlist);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 3a18571d0c..9766fc11ed 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1829,6 +1829,7 @@ _readIndexScan(void)
 	READ_NODE_FIELD(indexorderbyorig);
 	READ_NODE_FIELD(indexorderbyops);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
@@ -1848,6 +1849,7 @@ _readIndexOnlyScan(void)
 	READ_NODE_FIELD(indexorderby);
 	READ_NODE_FIELD(indextlist);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 945aa93374..fe6ef62e8f 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -125,6 +125,7 @@ int			max_parallel_workers_per_gather = 2;
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
+bool		enable_indexskipscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index bcb1bc6097..27c945413b 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -780,6 +780,16 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexPath  *ipath = (IndexPath *) lfirst(lc);
 
+		/*
+		 * To prevent unique paths from index skip scans being potentially used
+		 * when not needed scan keep them in a separate pathlist.
+		*/
+		if (ipath->indexskipprefix != 0)
+		{
+			add_unique_path(rel, (Path *) ipath);
+			continue;
+		}
+
 		if (index->amhasgettuple)
 			add_path(rel, (Path *) ipath);
 
@@ -862,12 +872,15 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	double		loop_count;
 	List	   *orderbyclauses;
 	List	   *orderbyclausecols;
-	List	   *index_pathkeys;
+	List	   *index_pathkeys = NIL;
 	List	   *useful_pathkeys;
+	List	   *index_pathkeys_pos = NIL;
 	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
+	bool		not_empty_qual = false;
+	bool		can_skip;
 	int			indexcol;
 
 	/*
@@ -985,7 +998,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	if (index_is_ordered && pathkeys_possibly_useful)
 	{
 		index_pathkeys = build_index_pathkeys(root, index,
-											  ForwardScanDirection);
+											  ForwardScanDirection,
+											  &index_pathkeys_pos);
 		useful_pathkeys = truncate_useless_pathkeys(root, rel,
 													index_pathkeys);
 		orderbyclauses = NIL;
@@ -1017,6 +1031,120 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	index_only_scan = (scantype != ST_BITMAPSCAN &&
 					   check_index_only(rel, index));
 
+	/* Check if an index skip scan is possible. */
+	can_skip = enable_indexskipscan & index->amcanskip;
+
+	if (can_skip)
+	{
+		/*
+		 * Skip scan is not supported when there are qual conditions, which are not
+		 * covered by index. The reason for that is that those conditions are
+		 * evaluated later, already after skipping was applied.
+		 *
+		 * TODO: This implementation is too restrictive, and doesn't allow e.g.
+		 * index expressions. For that we need to examine index_clauses too.
+		 */
+		if (root->parse->jointree != NULL)
+		{
+			ListCell *lc;
+
+			foreach(lc, (List *)root->parse->jointree->quals)
+			{
+				Node *expr, *qual = (Node *) lfirst(lc);
+				Var *var;
+				bool found = false;
+
+				if (!is_opclause(qual))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				expr = get_leftop(qual);
+
+				if (!IsA(expr, Var))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				var = (Var *) expr;
+
+				for (int i = 0; i < index->ncolumns; i++)
+				{
+					if (index->indexkeys[i] == var->varattno)
+					{
+						found = true;
+						break;
+					}
+				}
+
+				if (!found)
+				{
+					not_empty_qual = true;
+					break;
+				}
+			}
+		}
+
+		/*
+		 * For an index scan verify that index fully covers distinct
+		 * expressions, otherwise there is not enough information for skipping
+		 */
+		if (!index_only_scan && root->query_uniquekeys != NULL)
+		{
+			ListCell *lc;
+
+			foreach(lc, root->query_uniquekeys)
+			{
+				UniqueKey *uniqueKey = (UniqueKey *) lfirst(lc);
+				ListCell *lc1;
+
+				foreach(lc1, uniqueKey->exprs)
+				{
+					Expr *expr = (Expr *) lfirst(lc1);
+					bool found = false;
+
+					if (!IsA(expr, Var))
+					{
+						ListCell *lc2;
+
+						foreach(lc2, index->indexprs)
+						{
+							if(equal(lfirst(lc1), lfirst(lc2)))
+							{
+								found = true;
+								break;
+							}
+						}
+					}
+					else
+					{
+						Var *var = (Var *) expr;
+
+						for (int i = 0; i < index->ncolumns; i++)
+						{
+							if (index->indexkeys[i] == var->varattno)
+							{
+								found = true;
+								break;
+							}
+						}
+					}
+
+					if (!found)
+					{
+						can_skip = false;
+						break;
+					}
+				}
+
+				if (!can_skip)
+					break;
+			}
+		}
+	}
+
 	/*
 	 * 4. Generate an indexscan path if there are relevant restriction clauses
 	 * in the current clauses, OR the index ordering is potentially useful for
@@ -1040,6 +1168,32 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  false);
 		result = lappend(result, ipath);
 
+		/* Consider index skip scan as well */
+		if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
+		{
+			int numusefulkeys = list_length(useful_pathkeys);
+			int numsortkeys = list_length(root->query_pathkeys);
+
+			if (numusefulkeys == numsortkeys)
+			{
+				int prefix;
+				if (list_length(root->distinct_pathkeys) > 0)
+					prefix = find_index_prefix_for_pathkey(index_pathkeys,
+														   index_pathkeys_pos,
+														   llast_node(PathKey,
+														   root->distinct_pathkeys));
+				else
+					/* all are distinct keys are constant and optimized away.
+					 * skipping with 1 is sufficient as all are constant anyway
+					 */
+					prefix = 1;
+
+				result = lappend(result,
+								 create_skipscan_unique_path(root, index,
+															 (Path *) ipath, prefix));
+			}
+		}
+
 		/*
 		 * If appropriate, consider parallel index scan.  We don't allow
 		 * parallel index scan for bitmap index scans.
@@ -1078,7 +1232,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	if (index_is_ordered && pathkeys_possibly_useful)
 	{
 		index_pathkeys = build_index_pathkeys(root, index,
-											  BackwardScanDirection);
+											  BackwardScanDirection,
+											  &index_pathkeys_pos);
 		useful_pathkeys = truncate_useless_pathkeys(root, rel,
 													index_pathkeys);
 		if (useful_pathkeys != NIL)
@@ -1095,6 +1250,32 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  false);
 			result = lappend(result, ipath);
 
+			/* Consider index skip scan as well */
+			if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
+			{
+				int numusefulkeys = list_length(useful_pathkeys);
+				int numsortkeys = list_length(root->query_pathkeys);
+
+				if (numusefulkeys == numsortkeys)
+				{
+					int prefix;
+					if (list_length(root->distinct_pathkeys) > 0)
+						prefix = find_index_prefix_for_pathkey(index_pathkeys,
+															   index_pathkeys_pos,
+															   llast_node(PathKey,
+															   root->distinct_pathkeys));
+					else
+						/* all are distinct keys are constant and optimized away.
+						 * skipping with 1 is sufficient as all are constant anyway
+						 */
+						prefix = 1;
+
+					result = lappend(result,
+									 create_skipscan_unique_path(root, index,
+																 (Path *) ipath, prefix));
+				}
+			}
+
 			/* If appropriate, consider parallel index scan */
 			if (index->amcanparallel &&
 				rel->consider_parallel && outer_relids == NULL &&
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 97e1c2de83..61c0ea98e7 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -522,6 +522,47 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  *		NEW PATHKEY FORMATION
  ****************************************************************************/
 
+/*
+ * Find the prefix size for a specific path key in an index. For example, an
+ * index with (a,b,c) finding path key b will return prefix 2. Optionally
+ * pathkeys_positions can be provided, to specify at which position in the
+ * original pathkey list this particular key could be found (this is helpful
+ * when we deal with redundant pathkeys).
+ *
+ * Returns 0 when not found.
+ */
+int
+find_index_prefix_for_pathkey(List *index_pathkeys,
+							  List *pathkeys_positions,
+							  PathKey *target_pathkey)
+{
+	ListCell   *lc;
+	int			i;
+
+	i = 0;
+	foreach(lc, index_pathkeys)
+	{
+		PathKey    *cpathkey = (PathKey *) lfirst(lc);
+
+		if (cpathkey == target_pathkey)
+		{
+			/*
+			 * Prefix expected to start from 1, increment positions since
+			 * they're 0 based.
+			 */
+			if (pathkeys_positions != NIL &&
+				pathkeys_positions->length > i)
+				return list_nth_int(pathkeys_positions, i) + 1;
+			else
+				return i + 1;
+		}
+
+		i++;
+	}
+
+	return 0;
+}
+
 /*
  * build_index_pathkeys
  *	  Build a pathkeys list that describes the ordering induced by an index
@@ -534,7 +575,9 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  * We iterate only key columns of covering indexes, since non-key columns
  * don't influence index ordering.  The result is canonical, meaning that
  * redundant pathkeys are removed; it may therefore have fewer entries than
- * there are key columns in the index.
+ * there are key columns in the index. Since by removing redundant pathkeys the
+ * information about original position is lost, return it via positions
+ * argument.
  *
  * Another reason for stopping early is that we may be able to tell that
  * an index column's sort order is uninteresting for this query.  However,
@@ -545,7 +588,8 @@ get_cheapest_parallel_safe_total_inner(List *paths)
 List *
 build_index_pathkeys(PlannerInfo *root,
 					 IndexOptInfo *index,
-					 ScanDirection scandir)
+					 ScanDirection scandir,
+					 List **positions)
 {
 	List	   *retval = NIL;
 	ListCell   *lc;
@@ -554,6 +598,8 @@ build_index_pathkeys(PlannerInfo *root,
 	if (index->sortopfamily == NULL)
 		return NIL;				/* non-orderable index */
 
+	*positions = NIL;
+
 	i = 0;
 	foreach(lc, index->indextlist)
 	{
@@ -607,7 +653,11 @@ build_index_pathkeys(PlannerInfo *root,
 			 * for this query.  Add it to list, unless it's redundant.
 			 */
 			if (!pathkey_is_redundant(cpathkey, retval))
+			{
 				retval = lappend(retval, cpathkey);
+				*positions = lappend_int(*positions,
+										 foreach_current_index(lc));
+			}
 		}
 		else
 		{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 99278eed93..7185424c20 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -177,12 +177,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 								 Oid indexid, List *indexqual, List *indexqualorig,
 								 List *indexorderby, List *indexorderbyorig,
 								 List *indexorderbyops,
-								 ScanDirection indexscandir);
+								 ScanDirection indexscandir,
+								 int skipprefix);
 static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 Index scanrelid, Oid indexid,
 										 List *indexqual, List *indexorderby,
 										 List *indextlist,
-										 ScanDirection indexscandir);
+										 ScanDirection indexscandir,
+										 int skipprefix);
 static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
 											  List *indexqual,
 											  List *indexqualorig);
@@ -2987,7 +2989,8 @@ create_indexscan_plan(PlannerInfo *root,
 												fixed_indexquals,
 												fixed_indexorderbys,
 												best_path->indexinfo->indextlist,
-												best_path->indexscandir);
+												best_path->indexscandir,
+												best_path->indexskipprefix);
 	else
 		scan_plan = (Scan *) make_indexscan(tlist,
 											qpqual,
@@ -2998,7 +3001,8 @@ create_indexscan_plan(PlannerInfo *root,
 											fixed_indexorderbys,
 											indexorderbys,
 											indexorderbyops,
-											best_path->indexscandir);
+											best_path->indexscandir,
+											best_path->indexskipprefix);
 
 	copy_generic_path_info(&scan_plan->plan, &best_path->path);
 
@@ -5265,7 +5269,8 @@ make_indexscan(List *qptlist,
 			   List *indexorderby,
 			   List *indexorderbyorig,
 			   List *indexorderbyops,
-			   ScanDirection indexscandir)
+			   ScanDirection indexscandir,
+			   int skipPrefixSize)
 {
 	IndexScan  *node = makeNode(IndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5282,6 +5287,7 @@ make_indexscan(List *qptlist,
 	node->indexorderbyorig = indexorderbyorig;
 	node->indexorderbyops = indexorderbyops;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
@@ -5294,7 +5300,8 @@ make_indexonlyscan(List *qptlist,
 				   List *indexqual,
 				   List *indexorderby,
 				   List *indextlist,
-				   ScanDirection indexscandir)
+				   ScanDirection indexscandir,
+				   int skipPrefixSize)
 {
 	IndexOnlyScan *node = makeNode(IndexOnlyScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5309,6 +5316,7 @@ make_indexonlyscan(List *qptlist,
 	node->indexorderby = indexorderby;
 	node->indextlist = indextlist;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d4abb3cb47..e63ba85efd 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3003,6 +3003,43 @@ create_upper_unique_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_skipscan_unique_path
+ *	  Creates a pathnode the same as an existing IndexPath except based on
+ *	  skipping duplicate values.  This may or may not be cheaper than using
+ *	  create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root, IndexOptInfo *index,
+							Path *basepath, int prefix)
+{
+	IndexPath 	*pathnode = makeNode(IndexPath);
+	int 		numDistinctRows;
+	UniqueKey *ukey;
+
+	Assert(IsA(basepath, IndexPath));
+
+	/* We don't want to modify basepath, so make a copy. */
+	memcpy(pathnode, basepath, sizeof(IndexPath));
+
+	ukey = linitial_node(UniqueKey, root->query_uniquekeys);
+
+	Assert(prefix > 0);
+	pathnode->indexskipprefix = prefix;
+	pathnode->path.uniquekeys = root->query_uniquekeys;
+
+	numDistinctRows = estimate_num_groups(root, ukey->exprs,
+										  pathnode->path.rows,
+										  NULL);
+
+	pathnode->path.total_cost = pathnode->path.startup_cost * numDistinctRows;
+	pathnode->path.rows = numDistinctRows;
+
+	return pathnode;
+}
+
 /*
  * create_agg_path
  *	  Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 0b2f9d398a..6da71e5828 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -272,6 +272,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanskip = (amroutine->amskip != NULL);
 			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6f603cbbe8..b04ef1811d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -957,6 +957,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-skip-scan plans."),
+			NULL
+		},
+		&enable_indexskipscan,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5a0b8e9821..947dff83a5 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -357,6 +357,7 @@
 #enable_hashjoin = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexskipscan = on
 #enable_material = on
 #enable_mergejoin = on
 #enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 4325faa460..85cca4d520 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -130,6 +130,13 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+								 ScanDirection dir,
+								 ScanDirection indexdir,
+								 bool start,
+								 int prefix);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -231,6 +238,7 @@ typedef struct IndexAmRoutine
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
 	amrestrpos_function amrestrpos; /* can be NULL */
+	amskip_function amskip;				/* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 931257bd81..95ce4a5d80 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -180,6 +180,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
 												   IndexBulkDeleteResult *stats);
 extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction,
+					   ScanDirection indexdir, bool start, int prefix);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/sdir.h b/src/include/access/sdir.h
index 23feb90986..094a127464 100644
--- a/src/include/access/sdir.h
+++ b/src/include/access/sdir.h
@@ -55,4 +55,11 @@ typedef enum ScanDirection
 #define ScanDirectionIsForward(direction) \
 	((bool) ((direction) == ForwardScanDirection))
 
+/*
+ * ScanDirectionsAreOpposite
+ *		True iff scan directions are backward/forward or forward/backward.
+ */
+#define ScanDirectionsAreOpposite(dirA, dirB) \
+	((bool) (dirA != NoMovementScanDirection && dirA == -dirB))
+
 #endif							/* SDIR_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6f96b31fb4..68563b1d1b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1429,6 +1429,8 @@ typedef struct IndexScanState
 	ExprContext *iss_RuntimeContext;
 	Relation	iss_RelationDesc;
 	struct IndexScanDescData *iss_ScanDesc;
+	int         iss_SkipPrefixSize;
+	bool		iss_FirstTupleEmitted;
 
 	/* These are needed for re-checking ORDER BY expr ordering */
 	pairingheap *iss_ReorderQueue;
@@ -1458,6 +1460,8 @@ typedef struct IndexScanState
  *		TableSlot		   slot for holding tuples fetched from the table
  *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
+ *		SkipPrefixSize	   number of keys for skip-based DISTINCT
+ *		FirstTupleEmitted  has the first tuple been emitted
  * ----------------
  */
 typedef struct IndexOnlyScanState
@@ -1476,6 +1480,8 @@ typedef struct IndexOnlyScanState
 	struct IndexScanDescData *ioss_ScanDesc;
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
+	int         ioss_SkipPrefixSize;
+	bool		ioss_FirstTupleEmitted;
 	Size		ioss_PscanLen;
 } IndexOnlyScanState;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index a5c406bd4e..7287594582 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1229,6 +1229,9 @@ typedef struct Path
  * we need not recompute them when considering using the same index in a
  * bitmap index/heap scan (see BitmapHeapPath).  The costs of the IndexPath
  * itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
  *----------
  */
 typedef struct IndexPath
@@ -1241,6 +1244,7 @@ typedef struct IndexPath
 	ScanDirection indexscandir;
 	Cost		indextotalcost;
 	Selectivity indexselectivity;
+	int			indexskipprefix;
 } IndexPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 83e01074ed..efafdf9361 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -409,6 +409,8 @@ typedef struct IndexScan
 	List	   *indexorderbyorig;	/* the same in original form */
 	List	   *indexorderbyops;	/* OIDs of sort ops for ORDER BY exprs */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for distinct
+										 * scans */
 } IndexScan;
 
 /* ----------------
@@ -436,6 +438,8 @@ typedef struct IndexOnlyScan
 	List	   *indexorderby;	/* list of index ORDER BY exprs */
 	List	   *indextlist;		/* TargetEntry list describing index's cols */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for distinct
+										 * scans */
 } IndexOnlyScan;
 
 /* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 613db8eab6..c0f176eaaa 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
 extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 6796ad8cb7..8ec1780a56 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -207,6 +207,10 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
 												 Path *subpath,
 												 int numCols,
 												 double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+											  IndexOptInfo *index,
+											  Path *subpath,
+											  int prefix);
 extern AggPath *create_agg_path(PlannerInfo *root,
 								RelOptInfo *rel,
 								Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 0cb8030e33..6ebcb0e0a4 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -198,8 +198,11 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   Relids required_outer,
 													   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
+extern int find_index_prefix_for_pathkey(List *index_pathkeys,
+										 List *pathkey_positions,
+										 PathKey *target_pathkey);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
-								  ScanDirection scandir);
+								  ScanDirection scandir, List **positions);
 extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
 									  ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 06c4c3e476..ce883e5060 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -79,6 +79,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexskipscan           | on
  enable_material                | on
  enable_mergejoin               | on
  enable_nestloop                | on
@@ -90,7 +91,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(18 rows)
+(21 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
-- 
2.21.0

v36-0005-Btree-implementation-of-skipping.patchtext/x-diff; charset=us-asciiDownload

From ab40d525332e8d8a3bfc8c696aeb337ee7bf0413 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:45:08 +0200
Subject: [PATCH v36 5/6] Btree implementation of skipping

Implementation of Index Skip Scan for Btree. To make it suitable for
both situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 src/backend/access/nbtree/nbtree.c            |  13 +
 src/backend/access/nbtree/nbtsearch.c         | 513 +++++++++++++-
 src/include/access/nbtree.h                   |   7 +
 src/test/regress/expected/join.out            |   3 +
 src/test/regress/expected/select_distinct.out | 666 ++++++++++++++++++
 src/test/regress/expected/sysviews.out        |   2 +-
 src/test/regress/sql/join.sql                 |   5 +
 src/test/regress/sql/select_distinct.sql      | 286 ++++++++
 8 files changed, 1493 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index d65f4357cc..9914c5143d 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -136,6 +136,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
+	amroutine->amskip = btskip;
 	amroutine->amcostestimate = btcostestimate;
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
@@ -385,6 +386,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	 */
 	so->currTuples = so->markTuples = NULL;
 
+	so->skipScanKey = NULL;
+
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
@@ -452,6 +455,16 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	_bt_preprocess_array_keys(scan);
 }
 
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction,
+	   ScanDirection indexdir, bool start, int prefix)
+{
+	return _bt_skip(scan, direction, indexdir, start, prefix);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 28dc196b55..3fa6d5e204 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -45,7 +45,10 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+											Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+										Buffer buf, ScanDirection dir);
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -1482,6 +1485,459 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 	return true;
 }
 
+/*
+ *  _bt_skip() -- Skip items that have the same prefix as the most recently
+ * 				  fetched index tuple.
+ *
+ * 		The current position is set so that a subsequent call to _bt_next will
+ * 		fetch the first tuple that differs in the leading 'prefix' keys.
+ *
+ * 		There are four different kinds of skipping (depending on dir and
+ * 		indexdir, that are important to distinguish, especially in the presense
+ * 		of an index condition:
+ *
+ * 		* Advancing forward and reading forward
+ * 			simple scan
+ *
+ * 		* Advancing forward and reading backward
+ * 			scan inside a cursor fetching backward, when skipping is necessary
+ * 			right from the start
+ *
+ * 		* Advancing backward and reading forward
+ * 			scan with order by desc inside a cursor fetching forward, when
+ * 			skipping is necessary right from the start
+ *
+ * 		* Advancing backward and reading backward
+ * 			simple scan with order by desc
+ *
+ *      The current page is searched for the next unique value. If none is found
+ *      we will do a scan from the root in order to find the next page with
+ *      a unique value.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir,
+		 ScanDirection indexdir, bool scanstart, int prefix)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTStack stack;
+	Buffer buf;
+	OffsetNumber offnum;
+	BTScanPosItem *currItem;
+	Relation 	 indexRel = scan->indexRelation;
+
+	/* We want to return tuples, and we need a starting point */
+	Assert(scan->xs_want_itup);
+	Assert(scan->xs_itup);
+
+	if (so->numKilled > 0)
+		_bt_killitems(scan);
+
+	/* If skipScanKey is NULL then we initialize it with _bt_mkscankey */
+	if (so->skipScanKey == NULL)
+	{
+		so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+		so->skipScanKey->keysz = prefix;
+		so->skipScanKey->scantid = NULL;
+	}
+	so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+	_bt_update_skip_scankeys(scan, indexRel);
+
+	/* Check if the next unique key can be found within the current page.
+	 * Since we do not lock the current page between jumps, it's possible
+	 * that it was splitted since the last time we saw it. This is fine in
+	 * case of scanning forward, since page split to the right and we are
+	 * still on the left most page. In case of scanning backwards it's
+	 * possible to loose some pages and we need to remember the previous
+	 * page, and then follow the right link from the current page until we
+	 * find the original one.
+	 *
+	 * Since the whole idea of checking the current page is to protect
+	 * ourselves and make more performant statistic mismatch case when
+	 * there are too many distinct values for jumping, it's not clear if
+	 * the complexity of this solution in case of backward scan is
+	 * justified, so for now just avoid it.
+	 */
+	if (BufferIsValid(so->currPos.buf) && ScanDirectionIsForward(dir))
+	{
+		_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+
+		if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+		{
+			bool keyFound = false;
+
+			offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+			/* Lock the page for SERIALIZABLE transactions */
+			PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+							  scan->xs_snapshot);
+
+			/* We know in which direction to look */
+			_bt_initialize_more_data(so, dir);
+
+			/* Now read the data */
+			keyFound = _bt_readpage(scan, dir, offnum);
+
+			_bt_unlockbuf(indexRel, so->currPos.buf);
+			ReleaseBuffer(so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+
+			if (keyFound)
+			{
+				/* set IndexTuple */
+				currItem = &so->currPos.items[so->currPos.itemIndex];
+				scan->xs_heaptid = currItem->heapTid;
+				scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+				return true;
+			}
+		}
+		else
+			_bt_unlockbuf(indexRel, so->currPos.buf);
+	}
+
+	if (BufferIsValid(so->currPos.buf))
+	{
+		ReleaseBuffer(so->currPos.buf);
+		so->currPos.buf = InvalidBuffer;
+	}
+
+	/*
+	 * We haven't found scan key within the current page, so let's scan from
+	 * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+	 * number
+	 */
+	stack = _bt_search(scan->indexRelation, so->skipScanKey,
+					   &buf, BT_READ, scan->xs_snapshot);
+	_bt_freestack(stack);
+	so->currPos.buf = buf;
+	offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+	/* Lock the page for SERIALIZABLE transactions */
+	PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+					  scan->xs_snapshot);
+
+	/* We know in which direction to look */
+	_bt_initialize_more_data(so, dir);
+
+	/*
+	 * Simplest case is when both directions are forward, when we are already
+	 * at the next distinct key at the beginning of the series (so everything
+	 * else would be done in _bt_readpage)
+	 *
+	 * The case when both directions are backwards is also simple, but we need
+	 * to go one step back, since we need a last element from the previous
+	 * series.
+	 */
+	if (ScanDirectionIsBackward(dir) && ScanDirectionIsBackward(indexdir))
+		 offnum = OffsetNumberPrev(offnum);
+
+	/*
+	 * Andvance backward but read forward. At this moment we are at the next
+	 * distinct key at the beginning of the series. In case if scan just
+	 * started, we can read forward without doing anything else. Otherwise
+	 * find previous distinct key and the beginning of it's series and read
+	 * forward from there. To do so, go back one step, perform binary search
+	 * to find the first item in the series and let _bt_readpage do everything
+	 * else.
+	 */
+	else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir))
+	{
+		if (!scanstart)
+		{
+			/* Reading forward means we expect to see more data on the right */
+			so->currPos.moreRight = true;
+
+			offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+			/* One step back to find a previous value */
+			_bt_readpage(scan, dir, offnum);
+
+			_bt_unlockbuf(indexRel, so->currPos.buf);
+			if (_bt_next(scan, dir))
+			{
+				_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+				_bt_update_skip_scankeys(scan, indexRel);
+
+				/*
+				 * And now find the last item from the sequence for the
+				 * current, value with the intention do OffsetNumberNext. As a
+				 * result we end up on a first element from the sequence.
+				 */
+				if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf, dir))
+					offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+				else
+				{
+					if (BufferIsValid(so->currPos.buf))
+					{
+						/* Before leaving current page, deal with any killed items */
+						if (so->numKilled > 0)
+							_bt_killitems(scan);
+
+						_bt_unlockbuf(indexRel, so->currPos.buf);
+						ReleaseBuffer(so->currPos.buf);
+						so->currPos.buf = InvalidBuffer;
+					}
+
+					stack = _bt_search(scan->indexRelation, so->skipScanKey,
+									   &buf, BT_READ, scan->xs_snapshot);
+					_bt_freestack(stack);
+					so->currPos.buf = buf;
+					offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+				}
+			}
+			else
+			{
+				pfree(so->skipScanKey);
+				so->skipScanKey = NULL;
+				return false;
+			}
+		}
+	}
+
+	/*
+	 * Advance forward but read backward. At this moment we are at the next
+	 * distinct key at the beginning of the series. In case if scan just
+	 * started, we can go one step back and read forward without doing
+	 * anything else. Otherwise find the next distinct key and the beginning
+	 * of it's series, go one step back and read backward from there.
+	 *
+	 * An interesting situation can happen if one of distinct keys do not pass
+	 * a corresponding index condition at all. In this case reading backward
+	 * can lead to a previous distinct key being found, creating a loop. To
+	 * avoid that check the value to be returned, and jump one more time if
+	 * it's the same as at the beginning. Note that we do not check visibility
+	 * here, and dead tuples could also lead to the same situation. This has to
+	 * be checked on the caller side.
+	 */
+	else if (ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir))
+	{
+		if (scanstart)
+			offnum = OffsetNumberPrev(offnum);
+		else
+		{
+			IndexTuple 	startItup = CopyIndexTuple(scan->xs_itup);
+			bool 		nextFound = false;
+
+			/* Reading backwards means we expect to see more data on the left */
+			so->currPos.moreLeft = true;
+
+			while (!nextFound)
+			{
+				IndexTuple itup;
+				OffsetNumber jumpOffset;
+				CHECK_FOR_INTERRUPTS();
+
+				/*
+				 * Find a next index tuple to update scan key. It could be at
+				 * the end, so check for max offset
+				 */
+				if (!_bt_readpage(scan, ForwardScanDirection, offnum))
+				{
+					/*
+					 * There's no actually-matching data on this page.  Try to
+					 * advance to the next page. Return false if there's no
+					 * matching data at all.
+					 */
+					_bt_unlockbuf(indexRel, so->currPos.buf);
+					if (!_bt_steppage(scan, dir))
+					{
+						pfree(so->skipScanKey);
+						so->skipScanKey = NULL;
+						return false;
+					}
+					_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+				}
+
+				currItem = &so->currPos.items[so->currPos.firstItem];
+				itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+				scan->xs_itup = itup;
+
+				_bt_update_skip_scankeys(scan, indexRel);
+				if (BufferIsValid(so->currPos.buf))
+				{
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_bt_killitems(scan);
+
+					_bt_unlockbuf(indexRel, so->currPos.buf);
+					ReleaseBuffer(so->currPos.buf);
+					so->currPos.buf = InvalidBuffer;
+				}
+
+				stack = _bt_search(scan->indexRelation, so->skipScanKey,
+								   &buf, BT_READ, scan->xs_snapshot);
+				_bt_freestack(stack);
+				so->currPos.buf = buf;
+
+				/*
+				 * We need to remember the original offset after the jump,
+				 * since in case of looping this would be the next starting
+				 * point
+				 */
+				jumpOffset = offnum = _bt_binsrch(scan->indexRelation,
+												  so->skipScanKey, buf);
+				offnum = OffsetNumberPrev(offnum);
+
+				if (!_bt_readpage(scan, indexdir, offnum))
+				{
+					/*
+					 * There's no actually-matching data on this page.  Try to
+					 * advance to the next page. Return false if there's no
+					 * matching data at all.
+					 */
+					_bt_unlockbuf(indexRel, so->currPos.buf);
+					if (!_bt_steppage(scan, indexdir))
+					{
+						pfree(so->skipScanKey);
+						so->skipScanKey = NULL;
+						return false;
+					}
+					_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+				}
+
+				currItem = &so->currPos.items[so->currPos.lastItem];
+				itup = CopyIndexTuple((IndexTuple)
+						(so->currTuples + currItem->tupleOffset));
+
+				/*
+				 * To check if we returned the same tuple, try to find a
+				 * startItup on the current page. For that we need to update
+				 * scankey to match the whole tuple and set nextkey to return
+				 * an exact tuple, not the next one. If the tuple we found in
+				 * this way is equal to what we wanted to return, it means we
+				 * are in the loop, return offnum to the original position and
+				 * jump further
+				 *
+				 * Note that to compare tids we need to keep the leaf pinned,
+				 * otherwise there is a danger of vacuum cleaning up relevant
+				 * tuples.
+				 */
+				scan->xs_itup = startItup;
+				_bt_update_skip_scankeys(scan, indexRel);
+
+				so->skipScanKey->keysz = IndexRelationGetNumberOfKeyAttributes(indexRel);
+				so->skipScanKey->nextkey = false;
+
+				if (_bt_scankey_within_page(scan, so->skipScanKey,
+											so->currPos.buf, dir))
+				{
+					OffsetNumber maxoff, startOffset;
+					IndexTuple verifiedItup;
+					Page page = BufferGetPage(so->currPos.buf);
+					startOffset = _bt_binsrch(scan->indexRelation,
+											  so->skipScanKey,
+											  so->currPos.buf);
+
+					maxoff = PageGetMaxOffsetNumber(page);
+
+					/* Now read the data */
+					if (_bt_readpage(scan, ForwardScanDirection, startOffset))
+					{
+						ItemPointer resultTids, verifyTids;
+						int nresult = 1,
+							nverify = 1;
+
+						currItem = &so->currPos.items[so->currPos.itemIndex];
+						verifiedItup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+						/*
+						 * We need to keep in mind that tuples we deal with
+						 * could be also posting tuples and represent a list of
+						 * tids.
+						 */
+						if (BTreeTupleIsPosting(verifiedItup))
+						{
+							nverify = BTreeTupleGetNPosting(verifiedItup);
+							verifyTids = BTreeTupleGetPosting(verifiedItup);
+							for (int i = 1; i < nverify; i++)
+								verifyTids[i] = *BTreeTupleGetPostingN(verifiedItup, i);
+						}
+						else
+							verifyTids = &verifiedItup->t_tid;
+
+						if (BTreeTupleIsPosting(itup))
+						{
+							nresult = BTreeTupleGetNPosting(itup);
+							resultTids = BTreeTupleGetPosting(itup);
+							for (int i = 1; i < nresult; i++)
+								resultTids[i] = *BTreeTupleGetPostingN(itup, i);
+						}
+						else
+							resultTids = &itup->t_tid;
+
+						/* One not equal means they're not equal. */
+						for(int i = 0; i < nverify; i++)
+						{
+							for(int j = 0; j < nresult; j++)
+							{
+								if (!ItemPointerEquals(&resultTids[j], &verifyTids[i]))
+								{
+									nextFound = true;
+									break;
+								}
+							}
+						}
+
+						if (!nextFound)
+							offnum = jumpOffset;
+					}
+
+					if ((offnum > maxoff) & (so->currPos.nextPage == P_NONE))
+					{
+						_bt_unlockbuf(indexRel, so->currPos.buf);
+
+						BTScanPosUnpinIfPinned(so->currPos);
+						BTScanPosInvalidate(so->currPos);
+
+						pfree(so->skipScanKey);
+						so->skipScanKey = NULL;
+						return false;
+					}
+				}
+				else
+					/*
+					 * If startItup could be not found within the current page,
+					 * assume we found something new
+					 */
+					nextFound = true;
+
+				/* Return original scankey options */
+				so->skipScanKey->keysz = prefix;
+				so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+			}
+		}
+	}
+
+	/* Now read the data */
+	if (!_bt_readpage(scan, indexdir, offnum))
+	{
+		/*
+		 * There's no actually-matching data on this page.  Try to advance to
+		 * the next page.  Return false if there's no matching data at all.
+		 */
+		_bt_unlockbuf(indexRel, so->currPos.buf);
+		if (!_bt_steppage(scan, dir))
+		{
+			pfree(so->skipScanKey);
+			so->skipScanKey = NULL;
+			return false;
+		}
+	}
+	else
+		/* Drop the lock, and maybe the pin, on the current page */
+		_bt_unlockbuf(indexRel, so->currPos.buf);
+
+	/* And set IndexTuple */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_heaptid = currItem->heapTid;
+	scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+	so->currPos.moreLeft = true;
+	so->currPos.moreRight = true;
+
+	return true;
+}
+
 /*
  *	_bt_readpage() -- Load data from current index page into so->currPos
  *
@@ -2478,3 +2934,58 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 	so->numKilled = 0;			/* just paranoia */
 	so->markItemIndex = -1;		/* ditto */
 }
+
+/*
+ * _bt_update_skip_scankeys() -- set up a new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+	TupleDesc		itupdesc;
+	int			indnkeyatts,
+				i;
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	ScanKey			scankeys = so->skipScanKey->scankeys;
+
+	itupdesc = RelationGetDescr(indexRel);
+	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+	for (i = 0; i < indnkeyatts; i++)
+	{
+		Datum datum;
+		bool null;
+		int flags;
+
+		datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+		flags = (null ? SK_ISNULL : 0) |
+				(indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+		scankeys[i].sk_flags = flags;
+		scankeys[i].sk_argument = datum;
+	}
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ *
+ * Scankey nextkey will tell us if we need to find a current key or the next
+ * one, which affects whether or not it's ok to be equal to the page highkey.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+						Buffer buf, ScanDirection dir)
+{
+	OffsetNumber low, high;
+	Page page = BufferGetPage(buf);
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	int high_compare = key->nextkey ? 0 : 1;
+
+	low = P_FIRSTDATAKEY(opaque);
+	high = PageGetMaxOffsetNumber(page);
+
+	if (unlikely(high < low))
+		return false;
+
+	return (_bt_compare(scan->indexRelation, key, page, low) > 0 &&
+			_bt_compare(scan->indexRelation, key, page, high) < high_compare);
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index f5274cc750..a5744bcd71 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -938,6 +938,9 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* Work space for _bt_skip */
+	BTScanInsert	skipScanKey;	/* used to control skipping */
+
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -1099,6 +1102,8 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir,
+					 ScanDirection indexdir, bool start, int prefix);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 							   Snapshot snapshot);
 
@@ -1123,6 +1128,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
 extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir,
+				   ScanDirection indexdir, bool start, int prefix);
 extern bool btproperty(Oid index_oid, int attno,
 					   IndexAMProperty prop, const char *propname,
 					   bool *res, bool *isnull);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index a46b1573bd..a47e9faed2 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4429,6 +4429,8 @@ select d.* from d left join (select * from b group by b.id, b.c_id) s
          ->  Seq Scan on d
 (8 rows)
 
+-- disable index skip scan to prevent it interfering with the plan
+set enable_indexskipscan to off;
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
@@ -4446,6 +4448,7 @@ select d.* from d left join (select distinct * from b) s
          ->  Seq Scan on d
 (9 rows)
 
+set enable_indexskipscan to on;
 -- check join removal works when uniqueness of the join condition is enforced
 -- by a UNION
 explain (costs off)
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index 11c6f50fbf..ea8c9bee49 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -306,3 +306,669 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
  t
 (1 row)
 
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+CREATE INDEX ON distinct_a ((a + 1));
+ANALYZE distinct_a;
+SELECT DISTINCT a FROM distinct_a;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+ a 
+---
+ 1
+(1 row)
+
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Index Only Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+(2 rows)
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- test index skip scan for expressions
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT (a + 1) FROM distinct_a;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Scan using distinct_a_expr_idx on distinct_a
+   Skip scan: true
+(2 rows)
+
+SELECT DISTINCT (a + 1) FROM distinct_a;
+ ?column? 
+----------
+        2
+        3
+        4
+        5
+        6
+(5 rows)
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: true
+   Index Cond: (b = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: true
+   Index Cond: (b = 2)
+(3 rows)
+
+DROP INDEX distinct_a_b_a;
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b 
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+                          QUERY PLAN                          
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: true
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+                              QUERY PLAN                               
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: true
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+ 2 | 1 | 10
+ 3 | 1 | 10
+ 4 | 1 | 10
+ 5 | 1 | 10
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+   Index Cond: (a = 1)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT *
+FROM distinct_a;
+          QUERY PLAN          
+------------------------------
+ HashAggregate
+   Group Key: a, b, c
+   ->  Seq Scan on distinct_a
+(3 rows)
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Unique
+   ->  Index Scan using distinct_a_a_b_idx on distinct_a
+         Index Cond: (b = 2)
+         Filter: (c = 10)
+(4 rows)
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+ a | a 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 3
+ 4 | 4
+ 5 | 5
+(5 rows)
+
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+ a | ?column? 
+---+----------
+ 1 |        1
+ 2 |        1
+ 3 |        1
+ 4 |        1
+ 5 |        1
+(5 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+FETCH FROM c;
+ a 
+---
+ 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a 
+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 |  9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+                                 QUERY PLAN                                 
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+   Skip scan: true
+   Index Cond: ((b >= 1) AND (c = 0))
+(3 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c 
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
+-- test tuple killing
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 1 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 5 | 1 | 1 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 5 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 1 | 1 | 1 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(5 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(5 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- test posting lists
+CREATE TABLE distinct_posting (a int, b int, c int);
+CREATE INDEX ON distinct_posting (a, b, c);
+INSERT INTO distinct_posting
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+INSERT INTO distinct_posting (
+    SELECT 1 as a, 1 as b, 1 AS c
+        FROM generate_series(1,1000) i
+);
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c FROM distinct_posting WHERE c = 2
+    ORDER BY a DESC, b DESC;
+    FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+COMMIT;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index ce883e5060..e64e20a8cb 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -91,7 +91,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(21 rows)
+(19 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 1403e0ffe7..13e35fda88 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1519,11 +1519,16 @@ explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
   on d.a = s.id;
 
+-- disable index skip scan to prevent it interfering with the plan
+set enable_indexskipscan to off;
+
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
   on d.a = s.id;
 
+set enable_indexskipscan to on;
+
 -- check join removal works when uniqueness of the join condition is enforced
 -- by a UNION
 explain (costs off)
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index 33102744eb..f732b1c658 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -135,3 +135,289 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
 SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
 SELECT 2 IS NOT DISTINCT FROM null as "no";
 SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+CREATE INDEX ON distinct_a ((a + 1));
+ANALYZE distinct_a;
+
+SELECT DISTINCT a FROM distinct_a;
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- test index skip scan for expressions
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT (a + 1) FROM distinct_a;
+SELECT DISTINCT (a + 1) FROM distinct_a;
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+DROP INDEX distinct_a_b_a;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT *
+FROM distinct_a;
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
+
+-- test tuple killing
+
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- test posting lists
+CREATE TABLE distinct_posting (a int, b int, c int);
+CREATE INDEX ON distinct_posting (a, b, c);
+INSERT INTO distinct_posting
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+INSERT INTO distinct_posting (
+    SELECT 1 as a, 1 as b, 1 AS c
+        FROM generate_series(1,1000) i
+);
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c FROM distinct_posting WHERE c = 2
+    ORDER BY a DESC, b DESC;
+    FETCH ALL FROM c;
+
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
-- 
2.21.0

v36-0006-Index-skip-scan-documentation.patchtext/x-diff; charset=us-asciiDownload

From d42e5b480dae39cecff80ac9e13eb4af25045c34 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:45:21 +0200
Subject: [PATCH v36 6/6] Index skip scan documentation

Covers added GUC and the extenion of index am interface.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 doc/src/sgml/config.sgml  | 15 ++++++++++
 doc/src/sgml/indexam.sgml | 62 +++++++++++++++++++++++++++++++++++++++
 doc/src/sgml/indices.sgml | 23 +++++++++++++++
 3 files changed, 100 insertions(+)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 6ce5907896..7b78325813 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4654,6 +4654,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+      <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of index-skip-scan plan
+        types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+        <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-material" xreflabel="enable_material">
       <term><varname>enable_material</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 8c552c22bd..b3ca91211f 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -694,6 +694,68 @@ amrestrpos (IndexScanDesc scan);
 
   <para>
 <programlisting>
+bool
+amskip (IndexScanDesc scan,
+        ScanDirection direction,
+        ScanDirection indexdir,
+        bool scanstart,
+        int prefix);
+</programlisting>
+  Skip past all tuples where the first 'prefix' columns have the same value as
+  the last tuple returned in the current scan. The arguments are:
+
+   <variablelist>
+    <varlistentry>
+     <term><parameter>scan</parameter></term>
+     <listitem>
+      <para>
+       Index scan information
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>direction</parameter></term>
+     <listitem>
+      <para>
+       The direction in which data is advancing.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>indexdir</parameter></term>
+     <listitem>
+      <para>
+        The index direction, in which data must be read.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>scanstart</parameter></term>
+     <listitem>
+      <para>
+        Whether or not it is a start of the scan.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>prefix</parameter></term>
+     <listitem>
+      <para>
+        Distinct prefix size.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+
+  </para>
+
+  <para>
+<programlisting>
 Size
 amestimateparallelscan (void);
 </programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 28adaba72d..1cf8550e0c 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1281,6 +1281,29 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
    and later will recognize such cases and allow index-only scans to be
    generated, but older versions will not.
   </para>
+
+  <sect2 id="indexes-index-skip-scans">
+    <title>Index Skip Scans</title>
+
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index</primary>
+      <secondary>index-skip scans</secondary>
+    </indexterm>
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index-skip scan</primary>
+    </indexterm>
+
+    <para>
+     When the rows retrieved from an index scan are then deduplicated by
+     eliminating rows matching on a prefix of index keys (e.g. when using
+     <literal>SELECT DISTINCT</literal>), the planner will consider
+     skipping groups of rows with a matching key prefix. When a row with
+     a particular prefix is found, remaining rows with the same key prefix
+     are skipped.  The larger the number of rows with the same key prefix
+     rows (i.e. the lower the number of distinct key prefixes in the index),
+     the more efficient this is.
+    </para>
+  </sect2>
  </sect1>
 
 
-- 
2.21.0

#19

pg@bowt.ie

over 5 years ago

In reply to: Dmitry Dolgov (#18)

Re: Index Skip Scan (new UniqueKeys)

On Sat, Aug 15, 2020 at 7:09 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

Here is a new version that hopefully address most of the concerns
mentioned in this thread so far. As before, first two patches are taken
from UniqueKeys thread and attached only for the reference. List of
changes includes:

Some thoughts on this version of the patch series (I'm focussing on
v36-0005-Btree-implementation-of-skipping.patch again):

* I see the following compiler warning:

/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:
In function ‘populate_baserel_uniquekeys’:
/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13:
warning: ‘expr’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
797 | else if (!list_member(unique_index->rel->reltarget->exprs, expr))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Perhaps the warning is related to this nearby code that I noticed
Valgrind complains about:

==1083468== VALGRINDERROR-BEGIN
==1083468== Invalid read of size 4
==1083468== at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771)
==1083468== by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140)
==1083468== by 0x56AEA5: set_plain_rel_size (allpaths.c:586)
==1083468== by 0x56AADB: set_rel_size (allpaths.c:412)
==1083468== by 0x56A8CD: set_base_rel_sizes (allpaths.c:323)
==1083468== by 0x56A5A7: make_one_rel (allpaths.c:185)
==1083468== by 0x5AB426: query_planner (planmain.c:269)
==1083468== by 0x5AF02C: grouping_planner (planner.c:2058)
==1083468== by 0x5AD202: subquery_planner (planner.c:1015)
==1083468== by 0x5ABABF: standard_planner (planner.c:405)
==1083468== by 0x5AB7F8: planner (planner.c:275)
==1083468== by 0x6E6F84: pg_plan_query (postgres.c:875)
==1083468== by 0x6E70C4: pg_plan_queries (postgres.c:966)
==1083468== by 0x6E7497: exec_simple_query (postgres.c:1158)
==1083468== by 0x6EBCD3: PostgresMain (postgres.c:4309)
==1083468== by 0x624284: BackendRun (postmaster.c:4541)
==1083468== by 0x623995: BackendStartup (postmaster.c:4225)
==1083468== by 0x61FB70: ServerLoop (postmaster.c:1742)
==1083468== by 0x61F309: PostmasterMain (postmaster.c:1415)
==1083468== by 0x514AF2: main (main.c:209)
==1083468== Address 0x75f13e0 is 4,448 bytes inside a block of size
8,192 alloc'd
==1083468== at 0x483B7F3: malloc (in
/usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==1083468== by 0x8C15C8: AllocSetAlloc (aset.c:919)
==1083468== by 0x8CEA52: palloc (mcxt.c:964)
==1083468== by 0x267F25: systable_beginscan (genam.c:373)
==1083468== by 0x8682CE: SearchCatCacheMiss (catcache.c:1359)
==1083468== by 0x868167: SearchCatCacheInternal (catcache.c:1299)
==1083468== by 0x867E2C: SearchCatCache1 (catcache.c:1167)
==1083468== by 0x8860B2: SearchSysCache1 (syscache.c:1123)
==1083468== by 0x8BD482: check_enable_rls (rls.c:66)
==1083468== by 0x68A113: get_row_security_policies (rowsecurity.c:134)
==1083468== by 0x683C2C: fireRIRrules (rewriteHandler.c:2045)
==1083468== by 0x687340: QueryRewrite (rewriteHandler.c:3962)
==1083468== by 0x6E6EB1: pg_rewrite_query (postgres.c:784)
==1083468== by 0x6E6D23: pg_analyze_and_rewrite (postgres.c:700)
==1083468== by 0x6E7476: exec_simple_query (postgres.c:1155)
==1083468== by 0x6EBCD3: PostgresMain (postgres.c:4309)
==1083468== by 0x624284: BackendRun (postmaster.c:4541)
==1083468== by 0x623995: BackendStartup (postmaster.c:4225)
==1083468== by 0x61FB70: ServerLoop (postmaster.c:1742)
==1083468== by 0x61F309: PostmasterMain (postmaster.c:1415)
==1083468==
==1083468== VALGRINDERROR-END

(You'll see the same error if you run Postgres Valgrind + "make
installcheck", though I don't think that the queries in question are
tests that you yourself wrote.)

* IndexScanDescData.xs_itup comments could stand to be updated here --
IndexScanDescData.xs_want_itup is no longer just about index-only
scans.

* Do we really need the AM-level boolean flag/argument named
"scanstart"? Why not just follow the example of btgettuple(), which
determines whether or not the scan has been initialized based on the
current scan position?

Just because you set so->currPos.buf to InvalidBuffer doesn't mean you
cannot or should not take the same approach as btgettuple(). And even
if you can't take exactly the same approach, I would still think that
the scan's opaque B-Tree state should remember if it's the first call
to _bt_skip() (rather than some subsequent call) in some other way
(e.g. carrying a "scanstart" bool flag directly).

A part of my objection to "scanstart" is that it seems to require that
much of the code within _bt_skip() get another level of
indentation...which makes it even more difficult to follow.

* I don't understand what _bt_scankey_within_page() comments mean when
they refer to "the page highkey". It looks like this function examines
the highest data item on the page, not the high key.

It is highly confusing to refer to a tuple as the page high key if it
isn't the tuple from the P_HIKEY offset number on a non-rightmost
page, which is a pivot tuple even on the leaf level (as indicated by
BTreeTupleIsPivot()).

* Why does _bt_scankey_within_page() have an unused "ScanDirection
dir" argument?

* Why is it okay to do anything important based on the
_bt_scankey_within_page() return value?

If the page is empty, then how can we know that it's okay to go to the
next value? I'm concerned that there could be subtle bugs in this
area. VACUUM will usually just delete the empty page. But it won't
always do so, for a variety of reasons that aren't worth going into
now. This could mask bugs in this area. I'm concerned about patterns
like this one from _bt_skip():

while (!nextFound)
{
....

if (_bt_scankey_within_page(scan, so->skipScanKey,
so->currPos.buf, dir))
{
...
}
else
/*
* If startItup could be not found within the current page,
* assume we found something new
*/
nextFound = true;
....
}

Why would you assume that "we found something new" here? In general I
just don't understand the design of _bt_skip(). I get the basic idea
of what you're trying to do, but it could really use better comments.

*The "jump one more time if it's the same as at the beginning" thing
seems scary to me. Maybe you should be doing something with the actual
high key here.

* Tip: You might find cases involving "empty but not yet deleted"
pages a bit easier to test by temporarily disabling page deletion. You
can modify nbtree.c to look like this:

index a1ad22f785..db977a0300 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -1416,6 +1416,7 @@ backtrack:
         Assert(!attempt_pagedel || nhtidslive == 0);
     }

+ attempt_pagedel = false;
if (attempt_pagedel)
{
MemoryContext oldcontext;

That's all I have for now.

--
Peter Geoghegan

#20

pg@bowt.ie

over 5 years ago

In reply to: Peter Geoghegan (#19)

Re: Index Skip Scan (new UniqueKeys)

On Mon, Sep 21, 2020 at 5:59 PM Peter Geoghegan <pg@bowt.ie> wrote:

That's all I have for now.

One more thing. I don't think that this should be a bitwise AND:

if ((offnum > maxoff) & (so->currPos.nextPage == P_NONE))
{
....
}

--
Peter Geoghegan

#21

9erthalion6@gmail.com

over 5 years ago

In reply to: Peter Geoghegan (#19)

Re: Index Skip Scan (new UniqueKeys)

On Mon, Sep 21, 2020 at 05:59:32PM -0700, Peter Geoghegan wrote:

* I see the following compiler warning:

/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:
In function ‘populate_baserel_uniquekeys’:
/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13:
warning: ‘expr’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
797 | else if (!list_member(unique_index->rel->reltarget->exprs, expr))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is mostly for UniqueKeys patch, which is attached here only as a
dependency, but I'll prepare changes for that. Interesting enough I
can't reproduce this warning, but if I understand correctly gcc has some
history of spurious uninitialized warnings, so I guess it could be
version dependent.

* Perhaps the warning is related to this nearby code that I noticed
Valgrind complains about:

==1083468== VALGRINDERROR-BEGIN
==1083468== Invalid read of size 4
==1083468== at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771)
==1083468== by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140)

This also belongs to UniqueKeys patch, but at least I can reproduce this
one. My guess is that nkeycolums should be used there, not ncolums,
which is visible in index_incuding tests. The same as previous one, will
prepare corresponding changes.

* Do we really need the AM-level boolean flag/argument named
"scanstart"? Why not just follow the example of btgettuple(), which
determines whether or not the scan has been initialized based on the
current scan position?

Just because you set so->currPos.buf to InvalidBuffer doesn't mean you
cannot or should not take the same approach as btgettuple(). And even
if you can't take exactly the same approach, I would still think that
the scan's opaque B-Tree state should remember if it's the first call
to _bt_skip() (rather than some subsequent call) in some other way
(e.g. carrying a "scanstart" bool flag directly).

Yes, agree, carrying this flag inside the opaque state would be better.

* Why is it okay to do anything important based on the
_bt_scankey_within_page() return value?

If the page is empty, then how can we know that it's okay to go to the
next value? I'm concerned that there could be subtle bugs in this
area. VACUUM will usually just delete the empty page. But it won't
always do so, for a variety of reasons that aren't worth going into
now. This could mask bugs in this area. I'm concerned about patterns
like this one from _bt_skip():

while (!nextFound)
{
....

if (_bt_scankey_within_page(scan, so->skipScanKey,
so->currPos.buf, dir))
{
...
}
else
/*
* If startItup could be not found within the current page,
* assume we found something new
*/
nextFound = true;
....
}

Why would you assume that "we found something new" here? In general I
just don't understand the design of _bt_skip(). I get the basic idea
of what you're trying to do, but it could really use better comments.

Yeah, I'll put more efforts into clear comments. There are two different
ways in which _bt_scankey_within_page is being used.

The first one is to check if it's possible to skip traversal of the tree
from root in case if what we're looking for could be on the current
page. In this case an empty page would mean we need to search from the
root, so not sure what could be the issue here?

The second one (that you've highlighted above) I admit is probably the
most questionable part of the patch and open for suggestions how to
improve it. It's required for one particular case with a cursor when
scan advances forward but reads backward. What could happen here is we
found one valid item, but the next one e.g. do not pass scan key
conditions, and we end up with the previous item again. I'm not entirely
sure how presence of an empty page could change this scenario, could you
please show an example?

*The "jump one more time if it's the same as at the beginning" thing
seems scary to me. Maybe you should be doing something with the actual
high key here.

Same as for the previous question, can you give a hint what do you mean
by "doing something with the actual high key"?

#22

9erthalion6@gmail.com

about 5 years ago

In reply to: Dmitry Dolgov (#21)

6 attachment(s)

Re: Index Skip Scan (new UniqueKeys)

On Tue, Oct 06, 2020 at 05:20:39PM +0200, Dmitry Dolgov wrote:

On Mon, Sep 21, 2020 at 05:59:32PM -0700, Peter Geoghegan wrote:

* I see the following compiler warning:

/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:
In function ‘populate_baserel_uniquekeys’:
/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13:
warning: ‘expr’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
797 | else if (!list_member(unique_index->rel->reltarget->exprs, expr))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is mostly for UniqueKeys patch, which is attached here only as a
dependency, but I'll prepare changes for that. Interesting enough I
can't reproduce this warning, but if I understand correctly gcc has some
history of spurious uninitialized warnings, so I guess it could be
version dependent.

* Perhaps the warning is related to this nearby code that I noticed
Valgrind complains about:

==1083468== VALGRINDERROR-BEGIN
==1083468== Invalid read of size 4
==1083468== at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771)
==1083468== by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140)

This also belongs to UniqueKeys patch, but at least I can reproduce this
one. My guess is that nkeycolums should be used there, not ncolums,
which is visible in index_incuding tests. The same as previous one, will
prepare corresponding changes.

* Do we really need the AM-level boolean flag/argument named
"scanstart"? Why not just follow the example of btgettuple(), which
determines whether or not the scan has been initialized based on the
current scan position?

Just because you set so->currPos.buf to InvalidBuffer doesn't mean you
cannot or should not take the same approach as btgettuple(). And even
if you can't take exactly the same approach, I would still think that
the scan's opaque B-Tree state should remember if it's the first call
to _bt_skip() (rather than some subsequent call) in some other way
(e.g. carrying a "scanstart" bool flag directly).

Yes, agree, carrying this flag inside the opaque state would be better.

Here is a new version which doesn't require "scanstart" argument and
contains few other changes to address the issues mentioned earlier. It's
also based on the latest UniqueKeys patches with the valgrind issue
fixed (as before they're attached also just for the references, you can
find more in the original thread). I didn't rework commentaries yet,
will post it soon (need to get an inspiration first, probably via
reading Shakespeare unless someone has better suggestions).

* Why is it okay to do anything important based on the
_bt_scankey_within_page() return value?

If the page is empty, then how can we know that it's okay to go to the
next value? I'm concerned that there could be subtle bugs in this
area. VACUUM will usually just delete the empty page. But it won't
always do so, for a variety of reasons that aren't worth going into
now. This could mask bugs in this area. I'm concerned about patterns
like this one from _bt_skip():

while (!nextFound)
{
....

if (_bt_scankey_within_page(scan, so->skipScanKey,
so->currPos.buf, dir))
{
...
}
else
/*
* If startItup could be not found within the current page,
* assume we found something new
*/
nextFound = true;
....
}

Why would you assume that "we found something new" here? In general I
just don't understand the design of _bt_skip(). I get the basic idea
of what you're trying to do, but it could really use better comments.

Yeah, I'll put more efforts into clear comments. There are two different
ways in which _bt_scankey_within_page is being used.

The first one is to check if it's possible to skip traversal of the tree
from root in case if what we're looking for could be on the current
page. In this case an empty page would mean we need to search from the
root, so not sure what could be the issue here?

The second one (that you've highlighted above) I admit is probably the
most questionable part of the patch and open for suggestions how to
improve it. It's required for one particular case with a cursor when
scan advances forward but reads backward. What could happen here is we
found one valid item, but the next one e.g. do not pass scan key
conditions, and we end up with the previous item again. I'm not entirely
sure how presence of an empty page could change this scenario, could you
please show an example?

*The "jump one more time if it's the same as at the beginning" thing
seems scary to me. Maybe you should be doing something with the actual
high key here.

Same as for the previous question, can you give a hint what do you mean
by "doing something with the actual high key"?

The question is still there and I would really appreciate clarification
about what exactly scenarios I need to look for with empty pages. I've
tried to perform testing with "attempt_pagedel = false" suggestion, but
didn't find anything suspicious.

Attachments:

v37-0001-Introduce-RelOptInfo-notnullattrs-attribute.patchtext/x-diff; charset=us-asciiDownload

From 1beb7bbeb95c2de909f6719732bb30c888203e2e Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Sun, 3 May 2020 22:37:46 +0800
Subject: [PATCH v37 1/6] Introduce RelOptInfo->notnullattrs attribute

The notnullattrs is calculated from catalog and run-time query. That
infomation is translated to child relation as well for partitioned
table.
---
 src/backend/optimizer/path/allpaths.c  | 31 ++++++++++++++++++++++++++
 src/backend/optimizer/plan/initsplan.c | 10 +++++++++
 src/backend/optimizer/util/plancat.c   | 10 +++++++++
 src/include/nodes/pathnodes.h          |  2 ++
 4 files changed, 53 insertions(+)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b399592ff8..754f6d64f6 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1009,6 +1009,7 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 		RelOptInfo *childrel;
 		ListCell   *parentvars;
 		ListCell   *childvars;
+		int i = -1;
 
 		/* append_rel_list contains all append rels; ignore others */
 		if (appinfo->parent_relid != parentRTindex)
@@ -1065,6 +1066,36 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 								   (Node *) rel->reltarget->exprs,
 								   1, &appinfo);
 
+		/* Copy notnullattrs. */
+		while ((i = bms_next_member(rel->notnullattrs, i)) > 0)
+		{
+			AttrNumber attno = i + FirstLowInvalidHeapAttributeNumber;
+			AttrNumber child_attno;
+			if (attno == 0)
+			{
+				/* Whole row is not null, so must be same for child */
+				childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+														attno - FirstLowInvalidHeapAttributeNumber);
+				break;
+			}
+			if (attno < 0 )
+				/* no need to translate system column */
+				child_attno = attno;
+			else
+			{
+				Node * node = list_nth(appinfo->translated_vars, attno - 1);
+				if (!IsA(node, Var))
+					/* This may happens at UNION case, like (SELECT a FROM t1 UNION SELECT a + 3
+					 * FROM t2) t and we know t.a is not null
+					 */
+					continue;
+				child_attno = castNode(Var, node)->varattno;
+			}
+
+			childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+													child_attno - FirstLowInvalidHeapAttributeNumber);
+		}
+
 		/*
 		 * We have to make child entries in the EquivalenceClass data
 		 * structures as well.  This is needed either if the parent
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index e978b491f6..95b1b14cd3 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -830,6 +830,16 @@ deconstruct_recurse(PlannerInfo *root, Node *jtnode, bool below_outer_join,
 		{
 			Node	   *qual = (Node *) lfirst(l);
 
+			/* Set the not null info now */
+			ListCell	*lc;
+			List		*non_nullable_vars = find_nonnullable_vars(qual);
+			foreach(lc, non_nullable_vars)
+			{
+				Var *var = lfirst_node(Var, lc);
+				RelOptInfo *rel = root->simple_rel_array[var->varno];
+				rel->notnullattrs = bms_add_member(rel->notnullattrs,
+												   var->varattno - FirstLowInvalidHeapAttributeNumber);
+			}
 			distribute_qual_to_rels(root, qual,
 									false, below_outer_join, JOIN_INNER,
 									root->qual_security_level,
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index f9d0d67aa7..bdd4876a84 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -117,6 +117,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	Relation	relation;
 	bool		hasindex;
 	List	   *indexinfos = NIL;
+	int			i;
 
 	/*
 	 * We need not lock the relation since it was already locked, either by
@@ -463,6 +464,15 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	if (inhparent && relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		set_relation_partition_info(root, rel, relation);
 
+	Assert(rel->notnullattrs == NULL);
+	for(i = 0; i < relation->rd_att->natts; i++)
+	{
+		FormData_pg_attribute attr = relation->rd_att->attrs[i];
+		if (attr.attnotnull)
+			rel->notnullattrs = bms_add_member(rel->notnullattrs,
+											   attr.attnum - FirstLowInvalidHeapAttributeNumber);
+	}
+
 	table_close(relation, NoLock);
 
 	/*
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 3dd16b9ad5..925f2eac3f 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -708,6 +708,8 @@ typedef struct RelOptInfo
 	PlannerInfo *subroot;		/* if subquery */
 	List	   *subplan_params; /* if subquery */
 	int			rel_parallel_workers;	/* wanted number of parallel workers */
+	/* Not null attrs, start from -FirstLowInvalidHeapAttributeNumber */
+	Bitmapset		*notnullattrs;
 
 	/* Information about foreign tables and foreign joins */
 	Oid			serverid;		/* identifies server for the table or join */
-- 
2.21.0

v37-0002-Introduce-UniqueKey-attributes-on-RelOptInfo-str.patchtext/x-diff; charset=us-asciiDownload

From cd431def79c379143572c71ebb2082b669470415 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Mon, 11 May 2020 15:50:52 +0800
Subject: [PATCH v37 2/6] Introduce UniqueKey attributes on RelOptInfo struct.

UniqueKey is a set of exprs on RelOptInfo which represents the exprs
will be unique on the given RelOptInfo. You can see README.uniquekey
for more information.
---
 src/backend/nodes/copyfuncs.c               |   13 +
 src/backend/nodes/list.c                    |   31 +
 src/backend/nodes/makefuncs.c               |   13 +
 src/backend/nodes/outfuncs.c                |   11 +
 src/backend/nodes/readfuncs.c               |   10 +
 src/backend/optimizer/path/Makefile         |    3 +-
 src/backend/optimizer/path/README.uniquekey |  131 +++
 src/backend/optimizer/path/allpaths.c       |   10 +
 src/backend/optimizer/path/joinpath.c       |    9 +-
 src/backend/optimizer/path/joinrels.c       |    2 +
 src/backend/optimizer/path/pathkeys.c       |    3 +-
 src/backend/optimizer/path/uniquekeys.c     | 1131 +++++++++++++++++++
 src/backend/optimizer/plan/planner.c        |   13 +-
 src/backend/optimizer/prep/prepunion.c      |    2 +
 src/backend/optimizer/util/appendinfo.c     |   44 +
 src/backend/optimizer/util/inherit.c        |   16 +-
 src/include/nodes/makefuncs.h               |    3 +
 src/include/nodes/nodes.h                   |    1 +
 src/include/nodes/pathnodes.h               |   29 +-
 src/include/nodes/pg_list.h                 |    2 +
 src/include/optimizer/appendinfo.h          |    3 +
 src/include/optimizer/optimizer.h           |    2 +
 src/include/optimizer/paths.h               |   43 +
 23 files changed, 1502 insertions(+), 23 deletions(-)
 create mode 100644 src/backend/optimizer/path/README.uniquekey
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 2b4d7654cc..34392f5553 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2271,6 +2271,16 @@ _copyPathKey(const PathKey *from)
 	return newnode;
 }
 
+static UniqueKey *
+_copyUniqueKey(const UniqueKey *from)
+{
+	UniqueKey	*newnode = makeNode(UniqueKey);
+
+	COPY_NODE_FIELD(exprs);
+	COPY_SCALAR_FIELD(multi_nullvals);
+
+	return newnode;
+}
 /*
  * _copyRestrictInfo
  */
@@ -5149,6 +5159,9 @@ copyObjectImpl(const void *from)
 		case T_PathKey:
 			retval = _copyPathKey(from);
 			break;
+		case T_UniqueKey:
+			retval = _copyUniqueKey(from);
+			break;
 		case T_RestrictInfo:
 			retval = _copyRestrictInfo(from);
 			break;
diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index efa44342c4..20daf4a9fd 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -687,6 +687,37 @@ list_member_oid(const List *list, Oid datum)
 	return false;
 }
 
+/*
+ * return true iff every entry in "members" list is also present
+ * in the "target" list.
+ */
+bool
+list_is_subset(const List *members, const List *target)
+{
+	const ListCell	*lc1, *lc2;
+
+	Assert(IsPointerList(members));
+	Assert(IsPointerList(target));
+	check_list_invariants(members);
+	check_list_invariants(target);
+
+	foreach(lc1, members)
+	{
+		bool found = false;
+		foreach(lc2, target)
+		{
+			if (equal(lfirst(lc1), lfirst(lc2)))
+			{
+				found = true;
+				break;
+			}
+		}
+		if (!found)
+			return false;
+	}
+	return true;
+}
+
 /*
  * Delete the n'th cell (counting from 0) in list.
  *
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 49de285f01..646cf7c9a1 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -814,3 +814,16 @@ makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols)
 	v->va_cols = va_cols;
 	return v;
 }
+
+
+/*
+ * makeUniqueKey
+ */
+UniqueKey*
+makeUniqueKey(List *exprs, bool multi_nullvals)
+{
+	UniqueKey * ukey = makeNode(UniqueKey);
+	ukey->exprs = exprs;
+	ukey->multi_nullvals = multi_nullvals;
+	return ukey;
+}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 08a049232e..53cf4fcfa1 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2426,6 +2426,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+	WRITE_NODE_FIELD(exprs);
+	WRITE_BOOL_FIELD(multi_nullvals);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4125,6 +4133,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 				_outPathKey(str, obj);
 				break;
+			case T_UniqueKey:
+				_outUniqueKey(str, obj);
+				break;
 			case T_PathTarget:
 				_outPathTarget(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index ab7b535caa..7b9e8c3292 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -452,6 +452,14 @@ _readSetOperationStmt(void)
 	READ_DONE();
 }
 
+static UniqueKey *
+_readUniqueKey(void)
+{
+	READ_LOCALS(UniqueKey);
+	READ_NODE_FIELD(exprs);
+	READ_BOOL_FIELD(multi_nullvals);
+	READ_DONE();
+}
 
 /*
  *	Stuff from primnodes.h.
@@ -2654,6 +2662,8 @@ parseNodeString(void)
 		return_value = _readCommonTableExpr();
 	else if (MATCH("SETOPERATIONSTMT", 16))
 		return_value = _readSetOperationStmt();
+	else if (MATCH("UNIQUEKEY", 9))
+		return_value = _readUniqueKey();
 	else if (MATCH("ALIAS", 5))
 		return_value = _readAlias();
 	else if (MATCH("RANGEVAR", 8))
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..7b9820c25f 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekeys.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/README.uniquekey b/src/backend/optimizer/path/README.uniquekey
new file mode 100644
index 0000000000..5eac761995
--- /dev/null
+++ b/src/backend/optimizer/path/README.uniquekey
@@ -0,0 +1,131 @@
+1. What is UniqueKey?
+We can think UniqueKey is a set of exprs for a RelOptInfo, which we are insure
+that doesn't yields same result among all the rows. The simplest UniqueKey
+format is primary key.
+
+However we define the UnqiueKey as below.
+
+typedef struct UniqueKey
+{
+        NodeTag	type;
+        List	*exprs;
+        bool	multi_nullvals;
+} UniqueKey;
+
+exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+is a special case of UniqueKey, which means there is only one row in that
+relation.it has a stronger semantic than others. like SELECT uk FROM t; uk is
+normal unique key and may have different values. SELECT colx FROM t WHERE uk =
+const.  colx is unique AND we have only 1 value. This field can used for
+innerrel_is_unique. this logic is handled specially in add_uniquekey_for_onerow
+function.
+
+multi_nullvals: true means multi null values may exist in these exprs, so the
+uniqueness is not guaranteed in this case. This field is necessary for
+remove_useless_join & reduce_unique_semijoins where we don't mind these
+duplicated NULL values. It is set to true for 2 cases. One is a unique key
+from a unique index but the related column is nullable. The other one is for
+outer join. see populate_joinrel_uniquekeys for detail.
+
+
+The UniqueKey can be used at the following cases at least:
+1. remove_useless_joins.
+2. reduce_semianti_joins
+3. remove distinct node if distinct clause is unique.
+4. remove aggnode if group by clause is unique.
+5. Index Skip Scan (WIP)
+6. Aggregation Push Down without 2 phase aggregation if the join can't
+   duplicated the aggregated rows. (work in progress feature)
+
+2. How is it maintained?
+
+We have a set of populate_xxx_unqiuekeys functions to maintain the uniquekey on
+various cases. xxx includes baserel, joinrel, partitionedrel, distinctrel,
+groupedrel, unionrel. and we also need to convert the uniquekey from subquery
+to outer relation, which is what convert_subquery_uniquekeys does.
+
+1. The first part is about baserel. We handled 3 cases. suppose we have Unique
+Index on (a, b).
+
+1. SELECT a, b FROM t.  UniqueKey (a, b)
+2. SELECT a FROM t WHERE b = 1;  UniqueKey (a)
+3. SELECT .. FROM t WHERE a = 1 AND b = 1;  UniqueKey (NIL).  onerow case, every
+   column is Unique.
+
+2. The next part is joinrel, this part is most error-prone, we simplified the rules
+like below:
+1. If the relation's UniqueKey can't be duplicated after join,  then is will be
+   still valid for the join rel. The function we used here is
+   innerrel_keeps_unique. The basic idea is innerrel.any_col = outer.uk.
+
+2. If the UnqiueKey can't keep valid via the rule 1, the combination of the
+   UniqueKey from both sides are valid for sure.  We can prove this as: if the
+   unique exprs from rel1 is duplicated by rel2, the duplicated rows must
+   contains different unique exprs from rel2.
+
+More considerations about onerow:
+1. If relation with one row and it can't be duplicated, it is still possible
+   contains mulit_nullvas after outer join.
+2. If the either UniqueKey can be duplicated after join, the can get one row
+   only when both side is one row AND there is no outer join.
+3. Whenever the onerow UniqueKey is not a valid any more, we need to convert one
+   row UniqueKey to normal unique key since we don't store exprs for one-row
+   relation. get_exprs_from_uniquekeys will be used here.
+
+
+More considerations about multi_nullvals after join:
+1. If the original UnqiueKey has multi_nullvals, the final UniqueKey will have
+   mulit_nullvals in any case.
+2. If a unique key doesn't allow mulit_nullvals, after some outer join, it
+   allows some outer join.
+
+
+3. When we comes to subquery, we need to convert_subquery_unqiuekeys just like
+convert_subquery_pathkeys.  Only the UniqueKey insides subquery is referenced as
+a Var in outer relation will be reused. The relationship between the outerrel.Var
+and subquery.exprs is built with outerel->subroot->processed_tlist.
+
+
+4. As for the SRF functions, it will break the uniqueness of uniquekey, However it
+is handled in adjust_paths_for_srfs, which happens after the query_planner. so
+we will maintain the UniqueKey until there and reset it to NIL at that
+places. This can't help on distinct/group by elimination cases but probably help
+in some other cases, like reduce_unqiue_semijoins/remove_useless_joins and it is
+semantic correctly.
+
+
+5. As for inherit table, we first main the UnqiueKey on childrel as well. But for
+partitioned table we need to maintain 2 different kinds of
+UnqiueKey. 1). UniqueKey on the parent relation 2). UniqueKey on child
+relation for partition wise query.
+
+Example:
+CREATE TABLE p (a int not null, b int not null) partition by list (a);
+CREATE TABLE p0 partition of p for values in (1);
+CREATE TABLE p1 partition of p for values in (2);
+
+create unique index p0_b on p0(b);
+create unique index p1_b on p1(b);
+
+Now b is only unique on partition level, so the distinct can't be removed on
+the following cases. SELECT DISTINCT b FROM p;
+
+Another example is SELECT DISTINCT a, b FROM p WHERE a = 1; Since only one
+partition is chosen, the UniqueKey on child relation is same as the UniqueKey on
+parent relation.
+
+Another usage of UniqueKey on partition level is it be helpful for
+partition-wise join.
+
+As for the UniqueKey on parent table level, it comes with 2 different ways,
+1). the UniqueKey is also derived in UniqueKey index, but the index must be same
+in all the related children relations and the unique index must contains
+Partition Key in it. Example:
+
+CREATE UNIQUE INDEX p_ab ON p(a, b);  -- where a is the partition key.
+
+-- Query
+SELECT a, b FROM p; the (a, b) is a UniqueKey of p.
+
+2). If the parent relation has only one childrel, the UniqueKey on childrel is
+ the UniqueKey on parent as well.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 754f6d64f6..66d246fa1a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -579,6 +579,12 @@ set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	 */
 	check_index_predicates(root, rel);
 
+	/*
+	 * Now that we've marked which partial indexes are suitable, we can now
+	 * build the relation's unique keys.
+	 */
+	populate_baserel_uniquekeys(root, rel, rel->indexlist);
+
 	/* Mark rel with estimated output rows, width, etc */
 	set_baserel_size_estimates(root, rel);
 }
@@ -1314,6 +1320,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 
 	/* Add paths to the append relation. */
 	add_paths_to_append_rel(root, rel, live_childrels);
+	if (IS_PARTITIONED_REL(rel))
+		populate_partitionedrel_uniquekeys(root, rel, live_childrels);
 }
 
 
@@ -2387,6 +2395,8 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 										  pathkeys, required_outer));
 	}
 
+	convert_subquery_uniquekeys(root, rel, sub_final_rel);
+
 	/* If outer rel allows parallelism, do same for partial paths. */
 	if (rel->consider_parallel && bms_is_empty(required_outer))
 	{
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index 4a35903b29..f41be18e82 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -71,13 +71,6 @@ static void consider_parallel_mergejoin(PlannerInfo *root,
 static void hash_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 								 RelOptInfo *outerrel, RelOptInfo *innerrel,
 								 JoinType jointype, JoinPathExtraData *extra);
-static List *select_mergejoin_clauses(PlannerInfo *root,
-									  RelOptInfo *joinrel,
-									  RelOptInfo *outerrel,
-									  RelOptInfo *innerrel,
-									  List *restrictlist,
-									  JoinType jointype,
-									  bool *mergejoin_allowed);
 static void generate_mergejoin_paths(PlannerInfo *root,
 									 RelOptInfo *joinrel,
 									 RelOptInfo *innerrel,
@@ -1927,7 +1920,7 @@ hash_inner_and_outer(PlannerInfo *root,
  * if it is mergejoinable and involves vars from the two sub-relations
  * currently of interest.
  */
-static List *
+List *
 select_mergejoin_clauses(PlannerInfo *root,
 						 RelOptInfo *joinrel,
 						 RelOptInfo *outerrel,
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 2d343cd293..b9163ee8ff 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -924,6 +924,8 @@ populate_joinrel_with_paths(PlannerInfo *root, RelOptInfo *rel1,
 
 	/* Apply partitionwise join technique, if possible. */
 	try_partitionwise_join(root, rel1, rel2, joinrel, sjinfo, restrictlist);
+
+	populate_joinrel_uniquekeys(root, joinrel, rel1, rel2, restrictlist, sjinfo->jointype);
 }
 
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ce9bf87e9b..7e596d4194 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -33,7 +33,6 @@ static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
 											 int partkeycol);
-static Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 static bool right_merge_direction(PlannerInfo *root, PathKey *pathkey);
 
 
@@ -1035,7 +1034,7 @@ convert_subquery_pathkeys(PlannerInfo *root, RelOptInfo *rel,
  * We need this to ensure that we don't return pathkeys describing values
  * that are unavailable above the level of the subquery scan.
  */
-static Var *
+Var *
 find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle)
 {
 	ListCell   *lc;
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
new file mode 100644
index 0000000000..c7ad76d28f
--- /dev/null
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -0,0 +1,1131 @@
+/*-------------------------------------------------------------------------
+ *
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/uniquekeys.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/appendinfo.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/tlist.h"
+#include "rewrite/rewriteManip.h"
+
+
+/*
+ * This struct is used to help populate_joinrel_uniquekeys.
+ *
+ * added_to_joinrel is true if a uniquekey (from outerrel or innerrel)
+ * has been added to joinrel.
+ * useful is true if the exprs of the uniquekey still appears in joinrel.
+ */
+typedef struct UniqueKeyContextData
+{
+	UniqueKey	*uniquekey;
+	bool	added_to_joinrel;
+	bool	useful;
+} *UniqueKeyContext;
+
+static List *initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel);
+static bool innerrel_keeps_unique(PlannerInfo *root,
+								  RelOptInfo *outerrel,
+								  RelOptInfo *innerrel,
+								  List *restrictlist,
+								  bool reverse);
+
+static List *get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+										List *const_exprs,
+										List *const_expr_opfamilies,
+										Bitmapset *used_varattrs,
+										bool *useful,
+										bool *multi_nullvals);
+static List *get_exprs_from_uniquekey(RelOptInfo *joinrel,
+									  RelOptInfo *rel1,
+									  UniqueKey *ukey);
+static void add_uniquekey_for_onerow(RelOptInfo *rel);
+static bool add_combined_uniquekey(RelOptInfo *joinrel,
+								   RelOptInfo *outer_rel,
+								   RelOptInfo *inner_rel,
+								   UniqueKey *outer_ukey,
+								   UniqueKey *inner_ukey,
+								   JoinType jointype);
+
+/* Used for unique indexes checking for partitioned table */
+static bool index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind);
+static IndexOptInfo *simple_copy_indexinfo_to_parent(PlannerInfo *root,
+													 RelOptInfo *parentrel,
+													 IndexOptInfo *from);
+static bool simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2);
+static void adjust_partition_unique_indexlist(PlannerInfo *root,
+											  RelOptInfo *parentrel,
+											  RelOptInfo *childrel,
+											  List **global_unique_index);
+
+/* Helper function for grouped relation and distinct relation. */
+static void add_uniquekey_from_sortgroups(PlannerInfo *root,
+										  RelOptInfo *rel,
+										  List *sortgroups);
+
+/*
+ * populate_baserel_uniquekeys
+ *		Populate 'baserel' uniquekeys list by looking at the rel's unique index
+ * and baserestrictinfo
+ */
+void
+populate_baserel_uniquekeys(PlannerInfo *root,
+							RelOptInfo *baserel,
+							List *indexlist)
+{
+	ListCell *lc;
+	List	*matched_uniq_indexes = NIL;
+
+	/* Attrs appears in rel->reltarget->exprs. */
+	Bitmapset *used_attrs = NULL;
+
+	List	*const_exprs = NIL;
+	List	*expr_opfamilies = NIL;
+
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	foreach(lc, indexlist)
+	{
+		IndexOptInfo *ind = (IndexOptInfo *) lfirst(lc);
+		if (!ind->unique || !ind->immediate ||
+			(ind->indpred != NIL && !ind->predOK))
+			continue;
+		matched_uniq_indexes = lappend(matched_uniq_indexes, ind);
+	}
+
+	if (matched_uniq_indexes  == NIL)
+		return;
+
+	/* Check which attrs is used in baserel->reltarget */
+	pull_varattnos((Node *)baserel->reltarget->exprs, baserel->relid, &used_attrs);
+
+	/* Check which attrno is used at a mergeable const filter */
+	foreach(lc, baserel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+		if (rinfo->mergeopfamilies == NIL)
+			continue;
+
+		if (bms_is_empty(rinfo->left_relids))
+		{
+			const_exprs = lappend(const_exprs, get_rightop(rinfo->clause));
+		}
+		else if (bms_is_empty(rinfo->right_relids))
+		{
+			const_exprs = lappend(const_exprs, get_leftop(rinfo->clause));
+		}
+		else
+			continue;
+
+		expr_opfamilies = lappend(expr_opfamilies, rinfo->mergeopfamilies);
+	}
+
+	foreach(lc, matched_uniq_indexes)
+	{
+		bool	multi_nullvals, useful;
+		List	*exprs = get_exprs_from_uniqueindex(lfirst_node(IndexOptInfo, lc),
+													const_exprs,
+													expr_opfamilies,
+													used_attrs,
+													&useful,
+													&multi_nullvals);
+		if (useful)
+		{
+			if (exprs == NIL)
+			{
+				/* All the columns in Unique Index matched with a restrictinfo */
+				add_uniquekey_for_onerow(baserel);
+				return;
+			}
+			baserel->uniquekeys = lappend(baserel->uniquekeys,
+										  makeUniqueKey(exprs, multi_nullvals));
+		}
+	}
+}
+
+
+/*
+ * populate_partitionedrel_uniquekeys
+ * The UniqueKey on partitionrel comes from 2 cases:
+ * 1). Only one partition is involved in this query, the unique key can be
+ * copied to parent rel from childrel.
+ * 2). There are some unique index which includes partition key and exists
+ * in all the related partitions.
+ * We never mind rule 2 if we hit rule 1.
+ */
+
+void
+populate_partitionedrel_uniquekeys(PlannerInfo *root,
+								   RelOptInfo *rel,
+								   List *childrels)
+{
+	ListCell	*lc;
+	List	*global_uniq_indexlist = NIL;
+	RelOptInfo *childrel;
+	bool is_first = true;
+
+	Assert(IS_PARTITIONED_REL(rel));
+
+	if (childrels == NIL)
+		return;
+
+	/*
+	 * If there is only one partition used in this query, the UniqueKey in childrel is
+	 * still valid in parent level, but we need convert the format from child expr to
+	 * parent expr.
+	 */
+	if (list_length(childrels) == 1)
+	{
+		/* Check for Rule 1 */
+		RelOptInfo *childrel = linitial_node(RelOptInfo, childrels);
+		ListCell	*lc;
+		Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL);
+		if (relation_is_onerow(childrel))
+		{
+			add_uniquekey_for_onerow(rel);
+			return;
+		}
+
+		foreach(lc, childrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			AppendRelInfo *appinfo = find_appinfo_by_child(root, childrel->relid);
+			List *parent_exprs = NIL;
+			bool can_reuse = true;
+			ListCell	*lc2;
+			foreach(lc2, ukey->exprs)
+			{
+				Var *var = (Var *)lfirst(lc2);
+				/*
+				 * If the expr comes from a expression, it is hard to build the expression
+				 * in parent so ignore that case for now.
+				 */
+				if(!IsA(var, Var))
+				{
+					can_reuse = false;
+					break;
+				}
+				/* Convert it to parent var */
+				parent_exprs = lappend(parent_exprs, find_parent_var(appinfo, var));
+			}
+			if (can_reuse)
+				rel->uniquekeys = lappend(rel->uniquekeys,
+										  makeUniqueKey(parent_exprs,
+														ukey->multi_nullvals));
+		}
+	}
+	else
+	{
+		/* Check for rule 2 */
+		childrel = linitial_node(RelOptInfo, childrels);
+		foreach(lc, childrel->indexlist)
+		{
+			IndexOptInfo *ind = lfirst(lc);
+			IndexOptInfo *modified_index;
+			if (!ind->unique || !ind->immediate ||
+				(ind->indpred != NIL && !ind->predOK))
+				continue;
+
+			/*
+			 * During simple_copy_indexinfo_to_parent, we need to convert var from
+			 * child var to parent var, index on expression is too complex to handle.
+			 * so ignore it for now.
+			 */
+			if (ind->indexprs != NIL)
+				continue;
+
+			modified_index = simple_copy_indexinfo_to_parent(root, rel, ind);
+			/*
+			 * If the unique index doesn't contain partkey, then it is unique
+			 * on this partition only, so it is useless for us.
+			 */
+			if (!index_constains_partkey(rel, modified_index))
+				continue;
+
+			global_uniq_indexlist = lappend(global_uniq_indexlist,  modified_index);
+		}
+
+		if (global_uniq_indexlist != NIL)
+		{
+			foreach(lc, childrels)
+			{
+				RelOptInfo *child = lfirst(lc);
+				if (is_first)
+				{
+					is_first = false;
+					continue;
+				}
+				adjust_partition_unique_indexlist(root, rel, child, &global_uniq_indexlist);
+			}
+			/* Now we have a list of unique index which are exactly same on all childrels,
+			 * Set the UniqueKey just like it is non-partition table
+			 */
+			populate_baserel_uniquekeys(root, rel, global_uniq_indexlist);
+		}
+	}
+}
+
+
+/*
+ * populate_distinctrel_uniquekeys
+ */
+void
+populate_distinctrel_uniquekeys(PlannerInfo *root,
+								RelOptInfo *inputrel,
+								RelOptInfo *distinctrel)
+{
+	/* The unique key before the distinct is still valid. */
+	distinctrel->uniquekeys = list_copy(inputrel->uniquekeys);
+	add_uniquekey_from_sortgroups(root, distinctrel, root->parse->distinctClause);
+}
+
+/*
+ * populate_grouprel_uniquekeys
+ */
+void
+populate_grouprel_uniquekeys(PlannerInfo *root,
+							 RelOptInfo *grouprel,
+							 RelOptInfo *inputrel)
+
+{
+	Query *parse = root->parse;
+	bool input_ukey_added = false;
+	ListCell *lc;
+
+	if (relation_is_onerow(inputrel))
+	{
+		add_uniquekey_for_onerow(grouprel);
+		return;
+	}
+	if (parse->groupingSets)
+		return;
+
+	/* A Normal group by without grouping set. */
+	if (parse->groupClause)
+	{
+		/*
+		 * Current even the groupby clause is Unique already, but if query has aggref
+		 * We have to create grouprel still. To keep the UnqiueKey short, we will check
+		 * the UniqueKey of input_rel still valid, if so we reuse it.
+		 */
+		foreach(lc, inputrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(ukey->exprs, grouprel->reltarget->exprs))
+			{
+				grouprel->uniquekeys = lappend(grouprel->uniquekeys,
+											   ukey);
+				input_ukey_added = true;
+			}
+		}
+		if (!input_ukey_added)
+			/*
+			 * group by clause must be a super-set of grouprel->reltarget->exprs except the
+			 * aggregation expr, so if such exprs is unique already, no bother to generate
+			 * new uniquekey for group by exprs.
+			 */
+			add_uniquekey_from_sortgroups(root,
+										  grouprel,
+										  root->parse->groupClause);
+	}
+	else
+		/* It has aggregation but without a group by, so only one row returned */
+		add_uniquekey_for_onerow(grouprel);
+}
+
+/*
+ * simple_copy_uniquekeys
+ * Using a function for the one-line code makes us easy to check where we simply
+ * copied the uniquekey.
+ */
+void
+simple_copy_uniquekeys(RelOptInfo *oldrel,
+					   RelOptInfo *newrel)
+{
+	newrel->uniquekeys = oldrel->uniquekeys;
+}
+
+/*
+ *  populate_unionrel_uniquekeys
+ */
+void
+populate_unionrel_uniquekeys(PlannerInfo *root,
+							  RelOptInfo *unionrel)
+{
+	ListCell	*lc;
+	List	*exprs = NIL;
+
+	Assert(unionrel->uniquekeys == NIL);
+
+	foreach(lc, unionrel->reltarget->exprs)
+	{
+		exprs = lappend(exprs, lfirst(lc));
+	}
+
+	if (exprs == NIL)
+		/* SQL: select union select; is valid, we need to handle it here. */
+		add_uniquekey_for_onerow(unionrel);
+	else
+		unionrel->uniquekeys = lappend(unionrel->uniquekeys,
+									   makeUniqueKey(exprs,false));
+
+}
+
+/*
+ * populate_joinrel_uniquekeys
+ *
+ * populate uniquekeys for joinrel. We will check each relation to see if its
+ * UniqueKey is still valid via innerrel_keeps_unique, if so, we add it to
+ * joinrel.  The multi_nullvals field will be changed to true for some outer
+ * join cases and one-row UniqueKey needs to be converted to normal UniqueKey
+ * for the same case as well.
+ * For the uniquekey in either baserel which can't be unique after join, we still
+ * check to see if combination of UniqueKeys from both side is still useful for us.
+ * if yes, we add it to joinrel as well.
+ */
+void
+populate_joinrel_uniquekeys(PlannerInfo *root, RelOptInfo *joinrel,
+							RelOptInfo *outerrel, RelOptInfo *innerrel,
+							List *restrictlist, JoinType jointype)
+{
+	ListCell *lc, *lc2;
+	List	*clause_list = NIL;
+	List	*outerrel_ukey_ctx;
+	List	*innerrel_ukey_ctx;
+	bool	inner_onerow, outer_onerow;
+	bool	mergejoin_allowed;
+
+	/* Care about the outerrel relation only for SEMI/ANTI join */
+	if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
+	{
+		foreach(lc, outerrel->uniquekeys)
+		{
+			UniqueKey	*uniquekey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(uniquekey->exprs, joinrel->reltarget->exprs))
+				joinrel->uniquekeys = lappend(joinrel->uniquekeys, uniquekey);
+		}
+		return;
+	}
+
+	Assert(jointype == JOIN_LEFT || jointype == JOIN_FULL || jointype == JOIN_INNER);
+
+	/* Fast path */
+	if (innerrel->uniquekeys == NIL || outerrel->uniquekeys == NIL)
+		return;
+
+	inner_onerow = relation_is_onerow(innerrel);
+	outer_onerow = relation_is_onerow(outerrel);
+
+	outerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(outerrel);
+	innerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(innerrel);
+
+	clause_list = select_mergejoin_clauses(root, joinrel, outerrel, innerrel,
+										   restrictlist, jointype,
+										   &mergejoin_allowed);
+
+	if (innerrel_keeps_unique(root, innerrel, outerrel, clause_list, true /* reverse */))
+	{
+		bool outer_impact = jointype == JOIN_FULL;
+		foreach(lc, outerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			/* Outer relation has one row, and the unique key is not duplicated after join,
+			 * the joinrel will still has one row unless the jointype == JOIN_FULL.
+			 */
+			if (outer_onerow && !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (outer_onerow)
+			{
+				/*
+				 * The onerow outerrel becomes multi rows and multi_nullvals
+				 * will be changed to true. We also need to set the exprs correctly since it
+				 * can't be NIL any more.
+				 */
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(joinrel, outerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Change multi_nullvals to true due to the full join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs, true));
+				else
+					/* Just reuse it */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	if (innerrel_keeps_unique(root, outerrel, innerrel, clause_list, false))
+	{
+		bool outer_impact = jointype == JOIN_FULL || jointype == JOIN_LEFT;;
+
+		foreach(lc, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			if (inner_onerow &&  !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (inner_onerow)
+			{
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(joinrel, innerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Need to change multi_nullvals to true due to the outer join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs,
+																true));
+				else
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	/*
+	 * The combination of the UniqueKey from both sides is unique as well regardless
+	 * of join type, but no bother to add it if its subset has been added to joinrel
+	 * already or it is not useful for the joinrel.
+	 */
+	foreach(lc, outerrel_ukey_ctx)
+	{
+		UniqueKeyContext ctx1 = (UniqueKeyContext) lfirst(lc);
+		if (ctx1->added_to_joinrel || !ctx1->useful)
+			continue;
+		foreach(lc2, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx2 = (UniqueKeyContext) lfirst(lc2);
+			if (ctx2->added_to_joinrel || !ctx2->useful)
+				continue;
+			if (add_combined_uniquekey(joinrel, outerrel, innerrel,
+									   ctx1->uniquekey, ctx2->uniquekey,
+									   jointype))
+				/* If we set a onerow UniqueKey to joinrel, we don't need other. */
+				return;
+		}
+	}
+}
+
+
+/*
+ * convert_subquery_uniquekeys
+ *
+ * Covert the UniqueKey in subquery to outer relation.
+ */
+void convert_subquery_uniquekeys(PlannerInfo *root,
+								 RelOptInfo *currel,
+								 RelOptInfo *sub_final_rel)
+{
+	ListCell	*lc;
+
+	if (sub_final_rel->uniquekeys == NIL)
+		return;
+
+	if (relation_is_onerow(sub_final_rel))
+	{
+		add_uniquekey_for_onerow(currel);
+		return;
+	}
+
+	Assert(currel->subroot != NULL);
+
+	foreach(lc, sub_final_rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		ListCell	*lc;
+		List	*exprs = NIL;
+		bool	ukey_useful = true;
+
+		/* One row case is handled above */
+		Assert(ukey->exprs != NIL);
+		foreach(lc, ukey->exprs)
+		{
+			Var *var;
+			TargetEntry *tle = tlist_member(lfirst(lc),
+											currel->subroot->processed_tlist);
+			if (tle == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			var = find_var_for_subquery_tle(currel, tle);
+			if (var == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			exprs = lappend(exprs, var);
+		}
+
+		if (ukey_useful)
+			currel->uniquekeys = lappend(currel->uniquekeys,
+										 makeUniqueKey(exprs,
+													   ukey->multi_nullvals));
+
+	}
+}
+
+/*
+ * innerrel_keeps_unique
+ *
+ * Check if Unique key of the innerrel is valid after join. innerrel's UniqueKey
+ * will be still valid if innerrel's any-column mergeop outrerel's uniquekey
+ * exists in clause_list.
+ *
+ * Note: the clause_list must be a list of mergeable restrictinfo already.
+ */
+static bool
+innerrel_keeps_unique(PlannerInfo *root,
+					  RelOptInfo *outerrel,
+					  RelOptInfo *innerrel,
+					  List *clause_list,
+					  bool reverse)
+{
+	ListCell	*lc, *lc2, *lc3;
+
+	if (outerrel->uniquekeys == NIL || innerrel->uniquekeys == NIL)
+		return false;
+
+	/* Check if there is outerrel's uniquekey in mergeable clause. */
+	foreach(lc, outerrel->uniquekeys)
+	{
+		List	*outer_uq_exprs = lfirst_node(UniqueKey, lc)->exprs;
+		bool clauselist_matchs_all_exprs = true;
+		foreach(lc2, outer_uq_exprs)
+		{
+			Node *outer_uq_expr = lfirst(lc2);
+			bool find_uq_expr_in_clauselist = false;
+			foreach(lc3, clause_list)
+			{
+				RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc3);
+				Node *outer_expr;
+				if (reverse)
+					outer_expr = rinfo->outer_is_left ? get_rightop(rinfo->clause) : get_leftop(rinfo->clause);
+				else
+					outer_expr = rinfo->outer_is_left ? get_leftop(rinfo->clause) : get_rightop(rinfo->clause);
+				if (equal(outer_expr, outer_uq_expr))
+				{
+					find_uq_expr_in_clauselist = true;
+					break;
+				}
+			}
+			if (!find_uq_expr_in_clauselist)
+			{
+				/* No need to check the next exprs in the current uniquekey */
+				clauselist_matchs_all_exprs = false;
+				break;
+			}
+		}
+
+		if (clauselist_matchs_all_exprs)
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * relation_is_onerow
+ * Check if it is a one-row relation by checking UniqueKey.
+ */
+bool
+relation_is_onerow(RelOptInfo *rel)
+{
+	UniqueKey *ukey;
+	if (rel->uniquekeys == NIL)
+		return false;
+	ukey = linitial_node(UniqueKey, rel->uniquekeys);
+	return ukey->exprs == NIL && list_length(rel->uniquekeys) == 1;
+}
+
+/*
+ * relation_has_uniquekeys_for
+ *		Returns true if we have proofs that 'rel' cannot return multiple rows with
+ *		the same values in each of 'exprs'.  Otherwise returns false.
+ */
+bool
+relation_has_uniquekeys_for(PlannerInfo *root, RelOptInfo *rel,
+							List *exprs, bool allow_multinulls)
+{
+	ListCell *lc;
+
+	/*
+	 * For UniqueKey->onerow case, the uniquekey->exprs is empty as well
+	 * so we can't rely on list_is_subset to handle this special cases
+	 */
+	if (exprs == NIL)
+		return false;
+
+	foreach(lc, rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		if (ukey->multi_nullvals && !allow_multinulls)
+			continue;
+		if (list_is_subset(ukey->exprs, exprs))
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * get_exprs_from_uniqueindex
+ *
+ * Return a list of exprs which is unique. set useful to false if this
+ * unique index is not useful for us.
+ */
+static List *
+get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+						   List *const_exprs,
+						   List *const_expr_opfamilies,
+						   Bitmapset *used_varattrs,
+						   bool *useful,
+						   bool *multi_nullvals)
+{
+	List	*exprs = NIL;
+	ListCell	*indexpr_item;
+	int	c = 0;
+
+	*useful = true;
+	*multi_nullvals = false;
+
+	indexpr_item = list_head(unique_index->indexprs);
+	for(c = 0; c < unique_index->nkeycolumns; c++)
+	{
+		int attr = unique_index->indexkeys[c];
+		Expr *expr;
+		bool	matched_const = false;
+		ListCell	*lc1, *lc2;
+
+		if(attr > 0)
+		{
+			expr = list_nth_node(TargetEntry, unique_index->indextlist, c)->expr;
+		}
+		else if (attr == 0)
+		{
+			/* Expression index */
+			expr = lfirst(indexpr_item);
+			indexpr_item = lnext(unique_index->indexprs, indexpr_item);
+		}
+		else /* attr < 0 */
+		{
+			/* Index on system column is not supported */
+			Assert(false);
+		}
+
+		/*
+		 * Check index_col = Const case with regarding to opfamily checking
+		 * If we can remove the index_col from the final UniqueKey->exprs.
+		 */
+		forboth(lc1, const_exprs, lc2, const_expr_opfamilies)
+		{
+			if (list_member_oid((List *)lfirst(lc2), unique_index->opfamily[c])
+				&& match_index_to_operand((Node *) lfirst(lc1), c, unique_index))
+			{
+				matched_const = true;
+				break;
+			}
+		}
+
+		if (matched_const)
+			continue;
+
+		/* Check if the indexed expr is used in rel */
+		if (attr > 0)
+		{
+			/*
+			 * Normal Indexed column, if the col is not used, then the index is useless
+			 * for uniquekey.
+			 */
+			attr -= FirstLowInvalidHeapAttributeNumber;
+
+			if (!bms_is_member(attr, used_varattrs))
+			{
+				*useful = false;
+				break;
+			}
+		}
+		else if (!list_member(unique_index->rel->reltarget->exprs, expr))
+		{
+			/* Expression index but the expression is not used in rel */
+			*useful = false;
+			break;
+		}
+
+		/* check not null property. */
+		if (attr == 0)
+		{
+			/* We never know if a expression yields null or not */
+			*multi_nullvals = true;
+		}
+		else if (!bms_is_member(attr, unique_index->rel->notnullattrs)
+				 && !bms_is_member(0 - FirstLowInvalidHeapAttributeNumber,
+								   unique_index->rel->notnullattrs))
+		{
+			*multi_nullvals = true;
+		}
+
+		exprs = lappend(exprs, expr);
+	}
+	return exprs;
+}
+
+
+/*
+ * add_uniquekey_for_onerow
+ * If we are sure that the relation only returns one row, then all the columns
+ * are unique. However we don't need to create UniqueKey for every column, we
+ * just set exprs = NIL and overwrites all the other UniqueKey on this RelOptInfo
+ * since this one has strongest semantics.
+ */
+void
+add_uniquekey_for_onerow(RelOptInfo *rel)
+{
+	/*
+	 * We overwrite the previous UniqueKey on purpose since this one has the
+	 * strongest semantic.
+	 */
+	rel->uniquekeys = list_make1(makeUniqueKey(NIL, false));
+}
+
+
+/*
+ * initililze_uniquecontext_for_joinrel
+ * Return a List of UniqueKeyContext for an inputrel
+ */
+static List *
+initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel)
+{
+	List	*res = NIL;
+	ListCell *lc;
+	foreach(lc,  inputrel->uniquekeys)
+	{
+		UniqueKeyContext context;
+		context = palloc(sizeof(struct UniqueKeyContextData));
+		context->uniquekey = lfirst_node(UniqueKey, lc);
+		context->added_to_joinrel = false;
+		context->useful = true;
+		res = lappend(res, context);
+	}
+	return res;
+}
+
+
+/*
+ * get_exprs_from_uniquekey
+ *	Unify the way of get List of exprs from a one-row UniqueKey or
+ * normal UniqueKey. for the onerow case, every expr in rel1 is a valid
+ * UniqueKey. Return a List of exprs.
+ *
+ * rel1: The relation which you want to get the exprs.
+ * ukey: The UniqueKey you want to get the exprs.
+ */
+static List *
+get_exprs_from_uniquekey(RelOptInfo *joinrel, RelOptInfo *rel1, UniqueKey *ukey)
+{
+	ListCell *lc;
+	bool onerow = rel1 != NULL && relation_is_onerow(rel1);
+
+	List	*res = NIL;
+	Assert(onerow || ukey);
+	if (onerow)
+	{
+		/* Only cares about the exprs still exist in joinrel */
+		foreach(lc, joinrel->reltarget->exprs)
+		{
+			Bitmapset *relids = pull_varnos(lfirst(lc));
+			if (bms_is_subset(relids, rel1->relids))
+			{
+				res = lappend(res, list_make1(lfirst(lc)));
+			}
+		}
+	}
+	else
+	{
+		res = list_make1(ukey->exprs);
+	}
+	return res;
+}
+
+/*
+ * Partitioned table Unique Keys.
+ * The partition table unique key is maintained as:
+ * 1. The index must be unique as usual.
+ * 2. The index must contains partition key.
+ * 3. The index must exist on all the child rel. see simple_indexinfo_equal for
+ *    how we compare it.
+ */
+
+/*
+ * index_constains_partkey
+ * return true if the index contains the partiton key.
+ */
+static bool
+index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind)
+{
+	ListCell	*lc;
+	int	i;
+	Assert(IS_PARTITIONED_REL(partrel));
+	Assert(partrel->part_scheme->partnatts > 0);
+
+	for(i = 0; i < partrel->part_scheme->partnatts; i++)
+	{
+		Node *part_expr = linitial(partrel->partexprs[i]);
+		bool found_in_index = false;
+		foreach(lc, ind->indextlist)
+		{
+			Expr *index_expr = lfirst_node(TargetEntry, lc)->expr;
+			if (equal(index_expr, part_expr))
+			{
+				found_in_index = true;
+				break;
+			}
+		}
+		if (!found_in_index)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * simple_indexinfo_equal
+ *
+ * Used to check if the 2 index is same as each other. The index here
+ * is COPIED from childrel and did some tiny changes(see
+ * simple_copy_indexinfo_to_parent)
+ */
+static bool
+simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2)
+{
+	Size oid_cmp_len = sizeof(Oid) * ind1->ncolumns;
+
+	return ind1->ncolumns == ind2->ncolumns &&
+		ind1->unique == ind2->unique &&
+		memcmp(ind1->indexkeys, ind2->indexkeys, sizeof(int) * ind1->ncolumns) == 0 &&
+		memcmp(ind1->opfamily, ind2->opfamily, oid_cmp_len) == 0 &&
+		memcmp(ind1->opcintype, ind2->opcintype, oid_cmp_len) == 0 &&
+		memcmp(ind1->sortopfamily, ind2->sortopfamily, oid_cmp_len) == 0 &&
+		equal(get_tlist_exprs(ind1->indextlist, true),
+			  get_tlist_exprs(ind2->indextlist, true));
+}
+
+
+/*
+ * The below macros are used for simple_copy_indexinfo_to_parent which is so
+ * customized that I don't want to put it to copyfuncs.c. So copy it here.
+ */
+#define COPY_POINTER_FIELD(fldname, sz) \
+	do { \
+		Size	_size = (sz); \
+		newnode->fldname = palloc(_size); \
+		memcpy(newnode->fldname, from->fldname, _size); \
+	} while (0)
+
+#define COPY_NODE_FIELD(fldname) \
+	(newnode->fldname = copyObjectImpl(from->fldname))
+
+#define COPY_SCALAR_FIELD(fldname) \
+	(newnode->fldname = from->fldname)
+
+
+/*
+ * simple_copy_indexinfo_to_parent (from partition)
+ * Copy the IndexInfo from child relation to parent relation with some modification,
+ * which is used to test:
+ * 1. If the same index exists in all the childrels.
+ * 2. If the parentrel->reltarget/basicrestrict info matches this index.
+ */
+static IndexOptInfo *
+simple_copy_indexinfo_to_parent(PlannerInfo *root,
+								RelOptInfo *parentrel,
+								IndexOptInfo *from)
+{
+	IndexOptInfo *newnode = makeNode(IndexOptInfo);
+	AppendRelInfo *appinfo = find_appinfo_by_child(root, from->rel->relid);
+	ListCell	*lc;
+	int	idx = 0;
+
+	COPY_SCALAR_FIELD(ncolumns);
+	COPY_SCALAR_FIELD(nkeycolumns);
+	COPY_SCALAR_FIELD(unique);
+	COPY_SCALAR_FIELD(immediate);
+	/* We just need to know if it is NIL or not */
+	COPY_SCALAR_FIELD(indpred);
+	COPY_SCALAR_FIELD(predOK);
+	COPY_POINTER_FIELD(indexkeys, from->ncolumns * sizeof(int));
+	COPY_POINTER_FIELD(indexcollations, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opfamily, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opcintype, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(sortopfamily, from->ncolumns * sizeof(Oid));
+	COPY_NODE_FIELD(indextlist);
+
+	/* Convert index exprs on child expr to expr on parent */
+	foreach(lc, newnode->indextlist)
+	{
+		TargetEntry *tle = lfirst_node(TargetEntry, lc);
+		/* Index on expression is ignored */
+		Assert(IsA(tle->expr, Var));
+		tle->expr = (Expr *) find_parent_var(appinfo, (Var *) tle->expr);
+		newnode->indexkeys[idx] = castNode(Var, tle->expr)->varattno;
+		idx++;
+	}
+	newnode->rel = parentrel;
+	return newnode;
+}
+
+/*
+ * adjust_partition_unique_indexlist
+ *
+ * global_unique_indexes: At the beginning, it contains the copy & modified
+ * unique index from the first partition. And then check if each index in it still
+ * exists in the following partitions. If no, remove it. at last, it has an
+ * index list which exists in all the partitions.
+ */
+static void
+adjust_partition_unique_indexlist(PlannerInfo *root,
+								  RelOptInfo *parentrel,
+								  RelOptInfo *childrel,
+								  List **global_unique_indexes)
+{
+	ListCell	*lc, *lc2;
+	foreach(lc, *global_unique_indexes)
+	{
+		IndexOptInfo	*g_ind = lfirst_node(IndexOptInfo, lc);
+		bool found_in_child = false;
+
+		foreach(lc2, childrel->indexlist)
+		{
+			IndexOptInfo   *p_ind = lfirst_node(IndexOptInfo, lc2);
+			IndexOptInfo   *p_ind_copy;
+			if (!p_ind->unique || !p_ind->immediate ||
+				(p_ind->indpred != NIL && !p_ind->predOK))
+				continue;
+			p_ind_copy = simple_copy_indexinfo_to_parent(root, parentrel, p_ind);
+			if (simple_indexinfo_equal(p_ind_copy, g_ind))
+			{
+				found_in_child = true;
+				break;
+			}
+		}
+		if (!found_in_child)
+			/* The index doesn't exist in childrel, remove it from global_unique_indexes */
+			*global_unique_indexes = foreach_delete_current(*global_unique_indexes, lc);
+	}
+}
+
+/* Helper function for groupres/distinctrel */
+static void
+add_uniquekey_from_sortgroups(PlannerInfo *root, RelOptInfo *rel, List *sortgroups)
+{
+	Query *parse = root->parse;
+	List	*exprs;
+
+	/*
+	 * XXX: If there are some vars which is not in current levelsup, the semantic is
+	 * imprecise, should we avoid it or not? levelsup = 1 is just a demo, maybe we need to
+	 * check every level other than 0, if so, looks we have to write another
+	 * pull_var_walker.
+	 */
+	List	*upper_vars = pull_vars_of_level((Node*)sortgroups, 1);
+
+	if (upper_vars != NIL)
+		return;
+
+	exprs = get_sortgrouplist_exprs(sortgroups, parse->targetList);
+	rel->uniquekeys = lappend(rel->uniquekeys,
+							  makeUniqueKey(exprs,
+											false /* sortgroupclause can't be multi_nullvals */));
+}
+
+
+/*
+ * add_combined_uniquekey
+ * The combination of both UniqueKeys is a valid UniqueKey for joinrel no matter
+ * the jointype.
+ */
+bool
+add_combined_uniquekey(RelOptInfo *joinrel,
+					   RelOptInfo *outer_rel,
+					   RelOptInfo *inner_rel,
+					   UniqueKey *outer_ukey,
+					   UniqueKey *inner_ukey,
+					   JoinType jointype)
+{
+
+	ListCell	*lc1, *lc2;
+
+	/* Either side has multi_nullvals or we have outer join,
+	 * the combined UniqueKey has multi_nullvals */
+	bool multi_nullvals = outer_ukey->multi_nullvals ||
+		inner_ukey->multi_nullvals || IS_OUTER_JOIN(jointype);
+
+	/* The only case we can get onerow joinrel after join */
+	if  (relation_is_onerow(outer_rel)
+		 && relation_is_onerow(inner_rel)
+		 && jointype == JOIN_INNER)
+	{
+		add_uniquekey_for_onerow(joinrel);
+		return true;
+	}
+
+	foreach(lc1, get_exprs_from_uniquekey(joinrel, outer_rel, outer_ukey))
+	{
+		foreach(lc2, get_exprs_from_uniquekey(joinrel, inner_rel, inner_ukey))
+		{
+			List *exprs = list_concat_copy(lfirst_node(List, lc1), lfirst_node(List, lc2));
+			joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+										  makeUniqueKey(exprs,
+														multi_nullvals));
+		}
+	}
+	return false;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 986d7a52e3..f1bbb8c427 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2385,6 +2385,8 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 		add_path(final_rel, path);
 	}
 
+	simple_copy_uniquekeys(current_rel, final_rel);
+
 	/*
 	 * Generate partial paths for final_rel, too, if outer query levels might
 	 * be able to make use of them.
@@ -3895,6 +3897,8 @@ create_grouping_paths(PlannerInfo *root,
 	}
 
 	set_cheapest(grouped_rel);
+
+	populate_grouprel_uniquekeys(root, grouped_rel, input_rel);
 	return grouped_rel;
 }
 
@@ -4615,7 +4619,7 @@ create_window_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(window_rel);
-
+	simple_copy_uniquekeys(input_rel, window_rel);
 	return window_rel;
 }
 
@@ -4929,7 +4933,7 @@ create_distinct_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(distinct_rel);
-
+	populate_distinctrel_uniquekeys(root, input_rel, distinct_rel);
 	return distinct_rel;
 }
 
@@ -5190,6 +5194,8 @@ create_ordered_paths(PlannerInfo *root,
 	 */
 	Assert(ordered_rel->pathlist != NIL);
 
+	simple_copy_uniquekeys(input_rel, ordered_rel);
+
 	return ordered_rel;
 }
 
@@ -6067,6 +6073,9 @@ adjust_paths_for_srfs(PlannerInfo *root, RelOptInfo *rel,
 	if (list_length(targets) == 1)
 		return;
 
+	/* UniqueKey is not valid after handling the SRF. */
+	rel->uniquekeys = NIL;
+
 	/*
 	 * Stack SRF-evaluation nodes atop each path for the rel.
 	 *
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 745f443e5c..ce290cb97b 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -689,6 +689,8 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/* Undo effects of possibly forcing tuple_fraction to 0 */
 	root->tuple_fraction = save_fraction;
 
+	/* Add the UniqueKeys */
+	populate_unionrel_uniquekeys(root, result_rel);
 	return result_rel;
 }
 
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index d722063cf3..44c37ecffc 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -746,3 +746,47 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
 	}
 	return appinfos;
 }
+
+/*
+ * find_appinfo_by_child
+ *
+ */
+AppendRelInfo *
+find_appinfo_by_child(PlannerInfo *root, Index child_index)
+{
+	ListCell	*lc;
+	foreach(lc, root->append_rel_list)
+	{
+		AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
+		if (appinfo->child_relid == child_index)
+			return appinfo;
+	}
+	elog(ERROR, "parent relation cant be found");
+	return NULL;
+}
+
+/*
+ * find_parent_var
+ *
+ */
+Var *
+find_parent_var(AppendRelInfo *appinfo, Var *child_var)
+{
+	ListCell	*lc;
+	Var	*res = NULL;
+	Index attno = 1;
+	foreach(lc, appinfo->translated_vars)
+	{
+		Node *child_node = lfirst(lc);
+		if (equal(child_node, child_var))
+		{
+			res = copyObject(child_var);
+			res->varattno = attno;
+			res->varno = appinfo->parent_relid;
+		}
+		attno++;
+	}
+	if (res == NULL)
+		elog(ERROR, "parent var can't be found.");
+	return res;
+}
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 3132fd35a5..d66b40ec50 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -736,6 +736,7 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 		{
 			Node	   *onecq = (Node *) lfirst(lc2);
 			bool		pseudoconstant;
+			RestrictInfo	*child_rinfo;
 
 			/* check for pseudoconstant (no Vars or volatile functions) */
 			pseudoconstant =
@@ -747,13 +748,14 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 				root->hasPseudoConstantQuals = true;
 			}
 			/* reconstitute RestrictInfo with appropriate properties */
-			childquals = lappend(childquals,
-								 make_restrictinfo((Expr *) onecq,
-												   rinfo->is_pushed_down,
-												   rinfo->outerjoin_delayed,
-												   pseudoconstant,
-												   rinfo->security_level,
-												   NULL, NULL, NULL));
+			child_rinfo =  make_restrictinfo((Expr *) onecq,
+											 rinfo->is_pushed_down,
+											 rinfo->outerjoin_delayed,
+											 pseudoconstant,
+											 rinfo->security_level,
+											 NULL, NULL, NULL);
+			child_rinfo->mergeopfamilies = rinfo->mergeopfamilies;
+			childquals = lappend(childquals, child_rinfo);
 			/* track minimum security level among child quals */
 			cq_min_security = Min(cq_min_security, rinfo->security_level);
 		}
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 31d9aedeeb..c83f17acb7 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -16,6 +16,7 @@
 
 #include "nodes/execnodes.h"
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 
 extern A_Expr *makeA_Expr(A_Expr_Kind kind, List *name,
@@ -105,4 +106,6 @@ extern GroupingSet *makeGroupingSet(GroupingSetKind kind, List *content, int loc
 
 extern VacuumRelation *makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols);
 
+extern UniqueKey* makeUniqueKey(List *exprs, bool multi_nullvals);
+
 #endif							/* MAKEFUNC_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 7ddd8c011b..2bfbd353c7 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -263,6 +263,7 @@ typedef enum NodeTag
 	T_EquivalenceMember,
 	T_PathKey,
 	T_PathTarget,
+	T_UniqueKey,
 	T_RestrictInfo,
 	T_IndexClause,
 	T_PlaceHolderVar,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 925f2eac3f..5737cd76ce 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -729,6 +729,7 @@ typedef struct RelOptInfo
 	QualCost	baserestrictcost;	/* cost of evaluating the above */
 	Index		baserestrict_min_security;	/* min security_level found in
 											 * baserestrictinfo */
+	List	   *uniquekeys;		/* List of UniqueKey */
 	List	   *joininfo;		/* RestrictInfo structures for join clauses
 								 * involving this rel */
 	bool		has_eclass_joins;	/* T means joininfo is incomplete */
@@ -1046,6 +1047,28 @@ typedef struct PathKey
 } PathKey;
 
 
+/*
+ * UniqueKey
+ *
+ * Represents the unique properties held by a RelOptInfo.
+ *
+ * exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+ * is a special case of UniqueKey, which means there is only 1 row in that
+ * relation.
+ * multi_nullvals: true means multi null values may exist in these exprs, so the
+ * uniqueness is not guaranteed in this case. This field is necessary for
+ * remove_useless_join & reduce_unique_semijoins where we don't mind these
+ * duplicated NULL values. It is set to true for 2 cases. One is a unique key
+ * from a unique index but the related column is nullable. The other one is for
+ * outer join. see populate_joinrel_uniquekeys for detail.
+ */
+typedef struct UniqueKey
+{
+	NodeTag		type;
+	List	   *exprs;
+	bool		multi_nullvals;
+} UniqueKey;
+
 /*
  * PathTarget
  *
@@ -2472,7 +2495,7 @@ typedef enum
  *
  * flags indicating what kinds of grouping are possible.
  * partial_costs_set is true if the agg_partial_costs and agg_final_costs
- * 		have been initialized.
+ *		have been initialized.
  * agg_partial_costs gives partial aggregation costs.
  * agg_final_costs gives finalization costs.
  * target_parallel_safe is true if target is parallel safe.
@@ -2502,8 +2525,8 @@ typedef struct
  * limit_tuples is an estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate.
  * count_est and offset_est are the estimated values of the LIMIT and OFFSET
- * 		expressions computed by preprocess_limit() (see comments for
- * 		preprocess_limit() for more information).
+ *		expressions computed by preprocess_limit() (see comments for
+ *		preprocess_limit() for more information).
  */
 typedef struct
 {
diff --git a/src/include/nodes/pg_list.h b/src/include/nodes/pg_list.h
index ec231010ce..a1e279815c 100644
--- a/src/include/nodes/pg_list.h
+++ b/src/include/nodes/pg_list.h
@@ -543,6 +543,8 @@ extern bool list_member_ptr(const List *list, const void *datum);
 extern bool list_member_int(const List *list, int datum);
 extern bool list_member_oid(const List *list, Oid datum);
 
+extern bool list_is_subset(const List *members, const List *target);
+
 extern List *list_delete(List *list, void *datum);
 extern List *list_delete_ptr(List *list, void *datum);
 extern List *list_delete_int(List *list, int datum);
diff --git a/src/include/optimizer/appendinfo.h b/src/include/optimizer/appendinfo.h
index d6a27a60dd..e87c92a054 100644
--- a/src/include/optimizer/appendinfo.h
+++ b/src/include/optimizer/appendinfo.h
@@ -32,4 +32,7 @@ extern Relids adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
 extern AppendRelInfo **find_appinfos_by_relids(PlannerInfo *root,
 											   Relids relids, int *nappinfos);
 
+extern AppendRelInfo *find_appinfo_by_child(PlannerInfo *root, Index child_index);
+extern Var *find_parent_var(AppendRelInfo *appinfo, Var *child_var);
+
 #endif							/* APPENDINFO_H */
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index 3e4171056e..9445141263 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -23,6 +23,7 @@
 #define OPTIMIZER_H
 
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 /*
  * We don't want to include nodes/pathnodes.h here, because non-planner
@@ -156,6 +157,7 @@ extern TargetEntry *get_sortgroupref_tle(Index sortref,
 										 List *targetList);
 extern TargetEntry *get_sortgroupclause_tle(SortGroupClause *sgClause,
 											List *targetList);
+extern Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 extern Node *get_sortgroupclause_expr(SortGroupClause *sgClause,
 									  List *targetList);
 extern List *get_sortgrouplist_exprs(List *sgClauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 10b6e81079..9217a8d6c6 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -240,5 +240,48 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 									   int strategy, bool nulls_first);
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 									List *live_childrels);
+extern List *select_mergejoin_clauses(PlannerInfo *root,
+									  RelOptInfo *joinrel,
+									  RelOptInfo *outerrel,
+									  RelOptInfo *innerrel,
+									  List *restrictlist,
+									  JoinType jointype,
+									  bool *mergejoin_allowed);
+
+/*
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ */
+extern void populate_baserel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *baserel,
+										List* unique_index_list);
+extern void populate_partitionedrel_uniquekeys(PlannerInfo *root,
+												RelOptInfo *rel,
+												List *childrels);
+extern void populate_distinctrel_uniquekeys(PlannerInfo *root,
+											RelOptInfo *inputrel,
+											RelOptInfo *distinctrel);
+extern void populate_grouprel_uniquekeys(PlannerInfo *root,
+										 RelOptInfo *grouprel,
+										 RelOptInfo *inputrel);
+extern void populate_unionrel_uniquekeys(PlannerInfo *root,
+										  RelOptInfo *unionrel);
+extern void simple_copy_uniquekeys(RelOptInfo *oldrel,
+								   RelOptInfo *newrel);
+extern void convert_subquery_uniquekeys(PlannerInfo *root,
+										RelOptInfo *currel,
+										RelOptInfo *sub_final_rel);
+extern void populate_joinrel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *joinrel,
+										RelOptInfo *rel1,
+										RelOptInfo *rel2,
+										List *restrictlist,
+										JoinType jointype);
+
+extern bool relation_has_uniquekeys_for(PlannerInfo *root,
+										RelOptInfo *rel,
+										List *exprs,
+										bool allow_multinulls);
+extern bool relation_is_onerow(RelOptInfo *rel);
 
 #endif							/* PATHS_H */
-- 
2.21.0

v37-0003-Extend-UniqueKeys.patchtext/x-diff; charset=us-asciiDownload

From b751bc74fe681b6b649b0f63a61ba91e6e246c88 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:33:56 +0200
Subject: [PATCH v37 3/6] Extend UniqueKeys

Prepares index skip scan implementation using UniqueKeys. Allows to
specify what are the "requested" keys that should be unique, and add
them to necessary Paths to make them useful later.

Proposed by David Rowley, contains few bits out of previous version from
Jesper Pedersen.
---
 src/backend/optimizer/path/pathkeys.c   | 62 +++++++++++++++++++++++++
 src/backend/optimizer/path/uniquekeys.c | 61 ++++++++++++++++++++++++
 src/backend/optimizer/plan/planner.c    | 36 +++++++++++++-
 src/backend/optimizer/util/pathnode.c   | 32 +++++++++----
 src/include/nodes/pathnodes.h           |  5 ++
 src/include/optimizer/pathnode.h        |  1 +
 src/include/optimizer/paths.h           |  8 ++++
 7 files changed, 195 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 7e596d4194..97e1c2de83 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
 #include "utils/lsyscache.h"
 
 
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
@@ -95,6 +96,29 @@ make_canonical_pathkey(PlannerInfo *root,
 	return pk;
 }
 
+/*
+ * pathkey_is_unique
+ *	   Checks if the new pathkey's equivalence class is the same as that of
+ *     any existing member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+	EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+	ListCell   *lc;
+
+	/* If same EC already is already in the list, then not unique */
+	foreach(lc, pathkeys)
+	{
+		PathKey    *old_pathkey = (PathKey *) lfirst(lc);
+
+		if (new_ec == old_pathkey->pk_eclass)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * pathkey_is_redundant
  *	   Is a pathkey redundant with one already in the given list?
@@ -1151,6 +1175,44 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
 	return pathkeys;
 }
 
+/*
+ * make_pathkeys_for_uniquekeyclauses
+ *		Generate a pathkeys list to be used for uniquekey clauses
+ */
+List *
+make_pathkeys_for_uniquekeys(PlannerInfo *root,
+							 List *sortclauses,
+							 List *tlist)
+{
+	List	   *pathkeys = NIL;
+	ListCell   *l;
+
+	foreach(l, sortclauses)
+	{
+		SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+		Expr	   *sortkey;
+		PathKey    *pathkey;
+
+		sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+		Assert(OidIsValid(sortcl->sortop));
+		pathkey = make_pathkey_from_sortop(root,
+										   sortkey,
+										   root->nullable_baserels,
+										   sortcl->sortop,
+										   sortcl->nulls_first,
+										   sortcl->tleSortGroupRef,
+										   true);
+
+		if (EC_MUST_BE_REDUNDANT(pathkey->pk_eclass))
+			continue;
+
+		if (pathkey_is_unique(pathkey, pathkeys))
+			pathkeys = lappend(pathkeys, pathkey);
+	}
+
+	return pathkeys;
+}
+
 /****************************************************************************
  *		PATHKEYS AND MERGECLAUSES
  ****************************************************************************/
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
index c7ad76d28f..988d977715 100644
--- a/src/backend/optimizer/path/uniquekeys.c
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -1129,3 +1129,64 @@ add_combined_uniquekey(RelOptInfo *joinrel,
 	}
 	return false;
 }
+
+List*
+build_uniquekeys(PlannerInfo *root, List *sortclauses)
+{
+	List *result = NIL;
+	List *sortkeys;
+	ListCell *l;
+	List *exprs = NIL;
+
+	sortkeys = make_pathkeys_for_uniquekeys(root,
+											sortclauses,
+											root->processed_tlist);
+
+	/* Create a uniquekey and add it to the list */
+	foreach(l, sortkeys)
+	{
+		PathKey    *pathkey = (PathKey *) lfirst(l);
+		EquivalenceClass *ec = pathkey->pk_eclass;
+		EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members));
+		exprs = lappend(exprs, mem->em_expr);
+	}
+
+	result = lappend(result, makeUniqueKey(exprs, false));
+
+	return result;
+}
+
+bool
+query_has_uniquekeys_for(PlannerInfo *root, List *pathuniquekeys,
+						 bool allow_multinulls)
+{
+	ListCell *lc;
+	ListCell *lc2;
+
+	/* root->query_uniquekeys are the requested DISTINCT clauses on query level
+	 * pathuniquekeys are the unique keys on current path.
+	 * All requested query_uniquekeys must be satisfied by the pathuniquekeys
+	 */
+	foreach(lc, root->query_uniquekeys)
+	{
+		UniqueKey *query_ukey = lfirst_node(UniqueKey, lc);
+		bool satisfied = false;
+		foreach(lc2, pathuniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc2);
+			if (ukey->multi_nullvals && !allow_multinulls)
+				continue;
+			if (list_length(ukey->exprs) == 0 &&
+				list_length(query_ukey->exprs) != 0)
+				continue;
+			if (list_is_subset(ukey->exprs, query_ukey->exprs))
+			{
+				satisfied = true;
+				break;
+			}
+		}
+		if (!satisfied)
+			return false;
+	}
+	return true;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f1bbb8c427..7e0714a73e 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3623,12 +3623,18 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 
 	if (parse->distinctClause &&
 		grouping_is_sortable(parse->distinctClause))
+	{
 		root->distinct_pathkeys =
 			make_pathkeys_for_sortclauses(root,
 										  parse->distinctClause,
 										  tlist);
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
+	{
 		root->distinct_pathkeys = NIL;
+		root->query_uniquekeys = NIL;
+	}
 
 	root->sort_pathkeys =
 		make_pathkeys_for_sortclauses(root,
@@ -4839,13 +4845,19 @@ create_distinct_paths(PlannerInfo *root,
 			Path	   *path = (Path *) lfirst(lc);
 
 			if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
-			{
 				add_path(distinct_rel, (Path *)
 						 create_upper_unique_path(root, distinct_rel,
 												  path,
 												  list_length(root->distinct_pathkeys),
 												  numDistinctRows));
-			}
+		}
+
+		foreach(lc, input_rel->unique_pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+
+			if (query_has_uniquekeys_for(root, path->uniquekeys, false))
+				add_path(distinct_rel, path);
 		}
 
 		/* For explicit-sort case, always use the more rigorous clause */
@@ -7489,6 +7501,26 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		}
 	}
 
+	foreach(lc, rel->unique_pathlist)
+	{
+		Path	   *subpath = (Path *) lfirst(lc);
+
+		/* Shouldn't have any parameterized paths anymore */
+		Assert(subpath->param_info == NULL);
+
+		if (tlist_same_exprs)
+			subpath->pathtarget->sortgrouprefs =
+				scanjoin_target->sortgrouprefs;
+		else
+		{
+			Path	   *newpath;
+
+			newpath = (Path *) create_projection_path(root, rel, subpath,
+													  scanjoin_target);
+			lfirst(lc) = newpath;
+		}
+	}
+
 	/*
 	 * Now, if final scan/join target contains SRFs, insert ProjectSetPath(s)
 	 * atop each existing path.  (Note that this function doesn't look at the
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 5281a2f998..abcfe4d9d7 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -416,10 +416,10 @@ set_cheapest(RelOptInfo *parent_rel)
  * 'parent_rel' is the relation entry to which the path corresponds.
  * 'new_path' is a potential path for parent_rel.
  *
- * Returns nothing, but modifies parent_rel->pathlist.
+ * Returns modified pathlist.
  */
-void
-add_path(RelOptInfo *parent_rel, Path *new_path)
+static List *
+add_path_to(RelOptInfo *parent_rel, List *pathlist, Path *new_path)
 {
 	bool		accept_new = true;	/* unless we find a superior old path */
 	int			insert_at = 0;	/* where to insert new item */
@@ -440,7 +440,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	 * for more than one old path to be tossed out because new_path dominates
 	 * it.
 	 */
-	foreach(p1, parent_rel->pathlist)
+	foreach(p1, pathlist)
 	{
 		Path	   *old_path = (Path *) lfirst(p1);
 		bool		remove_old = false; /* unless new proves superior */
@@ -584,8 +584,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		 */
 		if (remove_old)
 		{
-			parent_rel->pathlist = foreach_delete_current(parent_rel->pathlist,
-														  p1);
+			pathlist = foreach_delete_current(pathlist, p1);
 
 			/*
 			 * Delete the data pointed-to by the deleted cell, if possible
@@ -612,8 +611,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	if (accept_new)
 	{
 		/* Accept the new path: insert it at proper place in pathlist */
-		parent_rel->pathlist =
-			list_insert_nth(parent_rel->pathlist, insert_at, new_path);
+		pathlist = list_insert_nth(pathlist, insert_at, new_path);
 	}
 	else
 	{
@@ -621,6 +619,23 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		if (!IsA(new_path, IndexPath))
 			pfree(new_path);
 	}
+
+	return pathlist;
+}
+
+void
+add_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->pathlist = add_path_to(parent_rel,
+									   parent_rel->pathlist, new_path);
+}
+
+void
+add_unique_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->unique_pathlist = add_path_to(parent_rel,
+											  parent_rel->unique_pathlist,
+											  new_path);
 }
 
 /*
@@ -2572,6 +2587,7 @@ create_projection_path(PlannerInfo *root,
 	pathnode->path.pathkeys = subpath->pathkeys;
 
 	pathnode->subpath = subpath;
+	pathnode->path.uniquekeys = subpath->uniquekeys;
 
 	/*
 	 * We might not need a separate Result node.  If the input plan node type
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 5737cd76ce..4a4ae3e926 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -295,6 +295,7 @@ struct PlannerInfo
 
 	List	   *query_pathkeys; /* desired pathkeys for query_planner() */
 
+	List	   *query_uniquekeys; /* unique keys required for the query */
 	List	   *group_pathkeys; /* groupClause pathkeys, if any */
 	List	   *window_pathkeys;	/* pathkeys of bottom window, if any */
 	List	   *distinct_pathkeys;	/* distinctClause pathkeys, if any */
@@ -678,6 +679,7 @@ typedef struct RelOptInfo
 	List	   *pathlist;		/* Path structures */
 	List	   *ppilist;		/* ParamPathInfos used in pathlist */
 	List	   *partial_pathlist;	/* partial Paths */
+	List       *unique_pathlist;    /* unique Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
 	struct Path *cheapest_unique_path;
@@ -865,6 +867,7 @@ struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanskip;		/* can AM skip duplicate values? */
 	bool		amcanparallel;	/* does AM support parallel scan? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
 	void		(*amcostestimate) ();	/* AM's cost estimator */
@@ -1181,6 +1184,8 @@ typedef struct Path
 
 	List	   *pathkeys;		/* sort ordering of path's output */
 	/* pathkeys is a List of PathKey nodes; see above */
+
+	List	   *uniquekeys;	/* the unique keys, or NIL if none */
 } Path;
 
 /* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 715a24ad29..6796ad8cb7 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -27,6 +27,7 @@ extern int	compare_fractional_path_costs(Path *path1, Path *path2,
 										  double fraction);
 extern void set_cheapest(RelOptInfo *parent_rel);
 extern void add_path(RelOptInfo *parent_rel, Path *new_path);
+extern void add_unique_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_path_precheck(RelOptInfo *parent_rel,
 							  Cost startup_cost, Cost total_cost,
 							  List *pathkeys, Relids required_outer);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9217a8d6c6..0cb8030e33 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -215,6 +215,9 @@ extern List *build_join_pathkeys(PlannerInfo *root,
 extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
 										   List *sortclauses,
 										   List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+										  List *sortclauses,
+										  List *tlist);
 extern void initialize_mergeclause_eclasses(PlannerInfo *root,
 											RestrictInfo *restrictinfo);
 extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -282,6 +285,11 @@ extern bool relation_has_uniquekeys_for(PlannerInfo *root,
 										RelOptInfo *rel,
 										List *exprs,
 										bool allow_multinulls);
+extern bool query_has_uniquekeys_for(PlannerInfo *root,
+									 List *exprs,
+									 bool allow_multinulls);
 extern bool relation_is_onerow(RelOptInfo *rel);
 
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+
 #endif							/* PATHS_H */
-- 
2.21.0

v37-0004-Index-skip-scan.patchtext/x-diff; charset=us-asciiDownload

From 68c39a3e9b12fefe1071bad3858005cdaab93374 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:39:13 +0200
Subject: [PATCH v37 4/6] Index skip scan

Implementation of basic Index Skip Scan (see Loose Index Scan in the
wiki [1]) infrastructure on top of IndexOnlyScan and IndexScan.
Introduces a new index am function amskip to allow advance past
duplicate key values in a scan. This innocently looking description
could be a bit tricky on the edge between am specific and common parts
of the implementation, mostly due to different information available at
each level, e.g. visibility. This means the common parts should apply
skipping multiple times if necessary.

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 contrib/bloom/blutils.c                       |   1 +
 doc/src/sgml/indexam.sgml                     |   1 +
 src/backend/access/brin/brin.c                |   1 +
 src/backend/access/gin/ginutil.c              |   1 +
 src/backend/access/gist/gist.c                |   1 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/index/indexam.c            |  18 ++
 src/backend/access/spgist/spgutils.c          |   1 +
 src/backend/commands/explain.c                |  25 +++
 src/backend/executor/nodeIndexonlyscan.c      |  97 ++++++++-
 src/backend/executor/nodeIndexscan.c          |  56 +++++-
 src/backend/nodes/copyfuncs.c                 |   2 +
 src/backend/nodes/outfuncs.c                  |   2 +
 src/backend/nodes/readfuncs.c                 |   2 +
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/optimizer/path/indxpath.c         | 187 +++++++++++++++++-
 src/backend/optimizer/path/pathkeys.c         |  54 ++++-
 src/backend/optimizer/plan/createplan.c       |  20 +-
 src/backend/optimizer/util/pathnode.c         |  37 ++++
 src/backend/optimizer/util/plancat.c          |   1 +
 src/backend/utils/misc/guc.c                  |   9 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/amapi.h                    |   7 +
 src/include/access/genam.h                    |   2 +
 src/include/access/sdir.h                     |   7 +
 src/include/nodes/execnodes.h                 |   6 +
 src/include/nodes/pathnodes.h                 |   4 +
 src/include/nodes/plannodes.h                 |   4 +
 src/include/optimizer/cost.h                  |   1 +
 src/include/optimizer/pathnode.h              |   4 +
 src/include/optimizer/paths.h                 |   5 +-
 src/test/regress/expected/sysviews.out        |   3 +-
 32 files changed, 545 insertions(+), 17 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 26b9927c3a..9f41527bdb 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -134,6 +134,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = blcostestimate;
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 649020b7da..1988192fbe 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -151,6 +151,7 @@ typedef struct IndexAmRoutine
     amendscan_function amendscan;
     ammarkpos_function ammarkpos;       /* can be NULL */
     amrestrpos_function amrestrpos;     /* can be NULL */
+    amskip_function amskip;             /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1f72562c60..622316ce40 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -115,6 +115,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = brincostestimate;
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index ef9b56fd36..a32301e19e 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -66,6 +66,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = gincostestimate;
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 25b42e38f2..e909409187 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -87,6 +87,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = gistcostestimate;
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 7c9ccf446c..d81a3de74b 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -84,6 +84,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = hashcostestimate;
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 3fb8688f8f..8e0f1b0140 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_skip		- advance past duplicate key values in a scan
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -737,6 +738,23 @@ index_can_return(Relation indexRelation, int attno)
 	return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
 }
 
+/* ----------------
+ *		index_skip
+ *
+ *		Skip past all tuples where the first 'prefix' columns have the
+ *		same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction,
+		   ScanDirection indexdir, bool scanstart, int prefix)
+{
+	SCAN_CHECKS;
+
+	return scan->indexRelation->rd_indam->amskip(scan, direction,
+												 indexdir, prefix);
+}
+
 /* ----------------
  *		index_getprocid
  *
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 64d3ba8288..70a610a030 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -69,6 +69,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = spgcostestimate;
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 43f9b01e83..aa960d9ffb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -149,6 +149,7 @@ static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
 static void ExplainIndentText(ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 
 
@@ -1081,6 +1082,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
 }
 
+/*
+ * ExplainIndexSkipScanKeys -
+ *	  Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es)
+{
+	if (skipPrefixSize > 0)
+	{
+		if (es->format != EXPLAIN_FORMAT_TEXT)
+			ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+	}
+}
+
 /*
  * ExplainNode -
  *	  Appends a description of a plan tree to es->str
@@ -1418,6 +1435,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexScan  *indexscan = (IndexScan *) plan;
 
+				ExplainIndexSkipScanKeys(indexscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexscan->indexid,
 										indexscan->indexorderdir,
 										es);
@@ -1428,6 +1447,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
 
+				ExplainIndexSkipScanKeys(indexonlyscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexonlyscan->indexid,
 										indexonlyscan->indexorderdir,
 										es);
@@ -1688,6 +1709,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_IndexScan:
+			if (((IndexScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyBool("Skip scan", true, es);
 			show_scan_qual(((IndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexScan *) plan)->indexqualorig)
@@ -1701,6 +1724,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 										   planstate, es);
 			break;
 		case T_IndexOnlyScan:
+			if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyBool("Skip scan", true, es);
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 5617ac29e7..670bf1a15e 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -41,6 +41,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/itemptr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -62,9 +63,26 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	EState	   *estate;
 	ExprContext *econtext;
 	ScanDirection direction;
+	ScanDirection readDirection;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	ItemPointerData startTid;
+	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
+
+	/*
+	 * Tells if the current position was reached via skipping. In this case
+	 * there is no nead for the index_getnext_tid
+	 */
+	bool skipped = false;
+
+	/*
+	 * Index only scan must be aware that in case of skipping we can return to
+	 * the starting point due to visibility checks. In this situation we need
+	 * to jump further, and number of skipping attempts tell us how far do we
+	 * need to do so.
+	 */
+	int skipAttempts = 0;
 
 	/*
 	 * extract necessary information from index scan node
@@ -72,7 +90,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -114,16 +132,87 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
 	}
+	else
+	{
+		ItemPointerCopy(&scandesc->xs_heaptid, &startTid);
+	}
+
+	/*
+	 * Check if we need to skip to the next key prefix, because we've been
+	 * asked to implement DISTINCT.
+	 *
+	 * When fetching a cursor in the direction opposite to a general scan
+	 * direction, the result must be what normal fetching should have
+	 * returned, but in reversed order. In other words, return the last or
+	 * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+	 * Due to that we skip also when the first tuple wasn't emitted yet, but
+	 * the directions are opposite.
+	 */
+	if (node->ioss_SkipPrefixSize > 0 &&
+		(node->ioss_FirstTupleEmitted ||
+		 ScanDirectionsAreOpposite(direction, indexonlyscan->indexorderdir)))
+	{
+		if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir,
+						!node->ioss_FirstTupleEmitted, node->ioss_SkipPrefixSize))
+		{
+			/*
+			 * Reached end of index. At this point currPos is invalidated, and
+			 * we need to reset ioss_FirstTupleEmitted, since otherwise after
+			 * going backwards, reaching the end of index, and going forward
+			 * again we apply skip again. It would be incorrect and lead to an
+			 * extra skipped item.
+			 */
+			node->ioss_FirstTupleEmitted = false;
+			return ExecClearTuple(slot);
+		}
+		else
+		{
+			skipAttempts = 1;
+			skipped = true;
+			tid = &scandesc->xs_heaptid;
+		}
+	}
+
+	readDirection = skipped ? indexonlyscan->indexorderdir : direction;
 
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (skipped || (tid = index_getnext_tid(scandesc, readDirection)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * If we already emitted first tuple, while doing index only skip scan
+		 * with advancing and reading in different directions we can return to
+		 * the same position where we started after visibility check. Recognize
+		 * such situations and skip more.
+		 */
+		if ((readDirection != direction) && node->ioss_FirstTupleEmitted &&
+			ItemPointerIsValid(&startTid) && ItemPointerEquals(&startTid, tid))
+		{
+			int i;
+			skipAttempts += 1;
+
+			for (i = 0; i < skipAttempts; i++)
+			{
+				if (!index_skip(scandesc, direction,
+								indexonlyscan->indexorderdir,
+								!node->ioss_FirstTupleEmitted,
+								node->ioss_SkipPrefixSize))
+				{
+					node->ioss_FirstTupleEmitted = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			tid = &scandesc->xs_heaptid;
+		}
+
+		skipped = false;
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -250,6 +339,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 							  ItemPointerGetBlockNumber(tid),
 							  estate->es_snapshot);
 
+		node->ioss_FirstTupleEmitted = true;
+
 		return slot;
 	}
 
@@ -504,6 +595,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+	indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->ioss_FirstTupleEmitted = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index d0a96a38e0..449aaec3ac 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,6 +85,13 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
+
+	/*
+	 * tells if the current position was reached via skipping. In this case
+	 * there is no nead for the index_getnext_tid
+	 */
+	bool skipped = false;
 
 	/*
 	 * extract necessary information from index scan node
@@ -92,7 +99,7 @@ IndexNext(IndexScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -117,6 +124,12 @@ IndexNext(IndexScanState *node)
 
 		node->iss_ScanDesc = scandesc;
 
+		/* Index skip scan assumes xs_want_itup, so set it to true */
+		if (indexscan->indexskipprefixsize > 0)
+			node->iss_ScanDesc->xs_want_itup = true;
+		else
+			node->iss_ScanDesc->xs_want_itup = false;
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -127,12 +140,48 @@ IndexNext(IndexScanState *node)
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	}
 
+	/*
+	 * Check if we need to skip to the next key prefix, because we've been
+	 * asked to implement DISTINCT.
+	 *
+	 * When fetching a cursor in the direction opposite to a general scan
+	 * direction, the result must be what normal fetching should have
+	 * returned, but in reversed order. In other words, return the last or
+	 * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+	 * Due to that we skip also when the first tuple wasn't emitted yet, but
+	 * the directions are opposite.
+	 */
+	if (node->iss_SkipPrefixSize > 0 &&
+		(node->iss_FirstTupleEmitted ||
+		 ScanDirectionsAreOpposite(direction, indexscan->indexorderdir)))
+	{
+		if (!index_skip(scandesc, direction, indexscan->indexorderdir,
+					   !node->iss_FirstTupleEmitted, node->iss_SkipPrefixSize))
+		{
+			/*
+			 * Reached end of index. At this point currPos is invalidated, and
+			 * we need to reset iss_FirstTupleEmitted, since otherwise after
+			 * going backwards, reaching the end of index, and going forward
+			 * again we apply skip again. It would be incorrect and lead to an
+			 * extra skipped item.
+			 */
+			node->iss_FirstTupleEmitted = false;
+			return ExecClearTuple(slot);
+		}
+		else
+		{
+			skipped = true;
+			index_fetch_heap(scandesc, slot);
+		}
+	}
+
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (skipped || index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
+		skipped = false;
 
 		/*
 		 * If the index was lossy, we have to recheck the index quals using
@@ -149,6 +198,7 @@ IndexNext(IndexScanState *node)
 			}
 		}
 
+		node->iss_FirstTupleEmitted = true;
 		return slot;
 	}
 
@@ -910,6 +960,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+	indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->iss_FirstTupleEmitted = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 34392f5553..5bd1e1ae5d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -490,6 +490,7 @@ _copyIndexScan(const IndexScan *from)
 	COPY_NODE_FIELD(indexorderbyorig);
 	COPY_NODE_FIELD(indexorderbyops);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
@@ -515,6 +516,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
 	COPY_NODE_FIELD(indexorderby);
 	COPY_NODE_FIELD(indextlist);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 53cf4fcfa1..282a00089d 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -559,6 +559,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
 	WRITE_NODE_FIELD(indexorderbyorig);
 	WRITE_NODE_FIELD(indexorderbyops);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
@@ -573,6 +574,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
 	WRITE_NODE_FIELD(indexorderby);
 	WRITE_NODE_FIELD(indextlist);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 7b9e8c3292..58fb93aada 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1826,6 +1826,7 @@ _readIndexScan(void)
 	READ_NODE_FIELD(indexorderbyorig);
 	READ_NODE_FIELD(indexorderbyops);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
@@ -1845,6 +1846,7 @@ _readIndexOnlyScan(void)
 	READ_NODE_FIELD(indexorderby);
 	READ_NODE_FIELD(indextlist);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 733f7ea543..cf31c42552 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -132,6 +132,7 @@ int			max_parallel_workers_per_gather = 2;
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
+bool		enable_indexskipscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index bcb1bc6097..27c945413b 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -780,6 +780,16 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexPath  *ipath = (IndexPath *) lfirst(lc);
 
+		/*
+		 * To prevent unique paths from index skip scans being potentially used
+		 * when not needed scan keep them in a separate pathlist.
+		*/
+		if (ipath->indexskipprefix != 0)
+		{
+			add_unique_path(rel, (Path *) ipath);
+			continue;
+		}
+
 		if (index->amhasgettuple)
 			add_path(rel, (Path *) ipath);
 
@@ -862,12 +872,15 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	double		loop_count;
 	List	   *orderbyclauses;
 	List	   *orderbyclausecols;
-	List	   *index_pathkeys;
+	List	   *index_pathkeys = NIL;
 	List	   *useful_pathkeys;
+	List	   *index_pathkeys_pos = NIL;
 	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
+	bool		not_empty_qual = false;
+	bool		can_skip;
 	int			indexcol;
 
 	/*
@@ -985,7 +998,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	if (index_is_ordered && pathkeys_possibly_useful)
 	{
 		index_pathkeys = build_index_pathkeys(root, index,
-											  ForwardScanDirection);
+											  ForwardScanDirection,
+											  &index_pathkeys_pos);
 		useful_pathkeys = truncate_useless_pathkeys(root, rel,
 													index_pathkeys);
 		orderbyclauses = NIL;
@@ -1017,6 +1031,120 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	index_only_scan = (scantype != ST_BITMAPSCAN &&
 					   check_index_only(rel, index));
 
+	/* Check if an index skip scan is possible. */
+	can_skip = enable_indexskipscan & index->amcanskip;
+
+	if (can_skip)
+	{
+		/*
+		 * Skip scan is not supported when there are qual conditions, which are not
+		 * covered by index. The reason for that is that those conditions are
+		 * evaluated later, already after skipping was applied.
+		 *
+		 * TODO: This implementation is too restrictive, and doesn't allow e.g.
+		 * index expressions. For that we need to examine index_clauses too.
+		 */
+		if (root->parse->jointree != NULL)
+		{
+			ListCell *lc;
+
+			foreach(lc, (List *)root->parse->jointree->quals)
+			{
+				Node *expr, *qual = (Node *) lfirst(lc);
+				Var *var;
+				bool found = false;
+
+				if (!is_opclause(qual))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				expr = get_leftop(qual);
+
+				if (!IsA(expr, Var))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				var = (Var *) expr;
+
+				for (int i = 0; i < index->ncolumns; i++)
+				{
+					if (index->indexkeys[i] == var->varattno)
+					{
+						found = true;
+						break;
+					}
+				}
+
+				if (!found)
+				{
+					not_empty_qual = true;
+					break;
+				}
+			}
+		}
+
+		/*
+		 * For an index scan verify that index fully covers distinct
+		 * expressions, otherwise there is not enough information for skipping
+		 */
+		if (!index_only_scan && root->query_uniquekeys != NULL)
+		{
+			ListCell *lc;
+
+			foreach(lc, root->query_uniquekeys)
+			{
+				UniqueKey *uniqueKey = (UniqueKey *) lfirst(lc);
+				ListCell *lc1;
+
+				foreach(lc1, uniqueKey->exprs)
+				{
+					Expr *expr = (Expr *) lfirst(lc1);
+					bool found = false;
+
+					if (!IsA(expr, Var))
+					{
+						ListCell *lc2;
+
+						foreach(lc2, index->indexprs)
+						{
+							if(equal(lfirst(lc1), lfirst(lc2)))
+							{
+								found = true;
+								break;
+							}
+						}
+					}
+					else
+					{
+						Var *var = (Var *) expr;
+
+						for (int i = 0; i < index->ncolumns; i++)
+						{
+							if (index->indexkeys[i] == var->varattno)
+							{
+								found = true;
+								break;
+							}
+						}
+					}
+
+					if (!found)
+					{
+						can_skip = false;
+						break;
+					}
+				}
+
+				if (!can_skip)
+					break;
+			}
+		}
+	}
+
 	/*
 	 * 4. Generate an indexscan path if there are relevant restriction clauses
 	 * in the current clauses, OR the index ordering is potentially useful for
@@ -1040,6 +1168,32 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  false);
 		result = lappend(result, ipath);
 
+		/* Consider index skip scan as well */
+		if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
+		{
+			int numusefulkeys = list_length(useful_pathkeys);
+			int numsortkeys = list_length(root->query_pathkeys);
+
+			if (numusefulkeys == numsortkeys)
+			{
+				int prefix;
+				if (list_length(root->distinct_pathkeys) > 0)
+					prefix = find_index_prefix_for_pathkey(index_pathkeys,
+														   index_pathkeys_pos,
+														   llast_node(PathKey,
+														   root->distinct_pathkeys));
+				else
+					/* all are distinct keys are constant and optimized away.
+					 * skipping with 1 is sufficient as all are constant anyway
+					 */
+					prefix = 1;
+
+				result = lappend(result,
+								 create_skipscan_unique_path(root, index,
+															 (Path *) ipath, prefix));
+			}
+		}
+
 		/*
 		 * If appropriate, consider parallel index scan.  We don't allow
 		 * parallel index scan for bitmap index scans.
@@ -1078,7 +1232,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	if (index_is_ordered && pathkeys_possibly_useful)
 	{
 		index_pathkeys = build_index_pathkeys(root, index,
-											  BackwardScanDirection);
+											  BackwardScanDirection,
+											  &index_pathkeys_pos);
 		useful_pathkeys = truncate_useless_pathkeys(root, rel,
 													index_pathkeys);
 		if (useful_pathkeys != NIL)
@@ -1095,6 +1250,32 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  false);
 			result = lappend(result, ipath);
 
+			/* Consider index skip scan as well */
+			if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
+			{
+				int numusefulkeys = list_length(useful_pathkeys);
+				int numsortkeys = list_length(root->query_pathkeys);
+
+				if (numusefulkeys == numsortkeys)
+				{
+					int prefix;
+					if (list_length(root->distinct_pathkeys) > 0)
+						prefix = find_index_prefix_for_pathkey(index_pathkeys,
+															   index_pathkeys_pos,
+															   llast_node(PathKey,
+															   root->distinct_pathkeys));
+					else
+						/* all are distinct keys are constant and optimized away.
+						 * skipping with 1 is sufficient as all are constant anyway
+						 */
+						prefix = 1;
+
+					result = lappend(result,
+									 create_skipscan_unique_path(root, index,
+																 (Path *) ipath, prefix));
+				}
+			}
+
 			/* If appropriate, consider parallel index scan */
 			if (index->amcanparallel &&
 				rel->consider_parallel && outer_relids == NULL &&
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 97e1c2de83..61c0ea98e7 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -522,6 +522,47 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  *		NEW PATHKEY FORMATION
  ****************************************************************************/
 
+/*
+ * Find the prefix size for a specific path key in an index. For example, an
+ * index with (a,b,c) finding path key b will return prefix 2. Optionally
+ * pathkeys_positions can be provided, to specify at which position in the
+ * original pathkey list this particular key could be found (this is helpful
+ * when we deal with redundant pathkeys).
+ *
+ * Returns 0 when not found.
+ */
+int
+find_index_prefix_for_pathkey(List *index_pathkeys,
+							  List *pathkeys_positions,
+							  PathKey *target_pathkey)
+{
+	ListCell   *lc;
+	int			i;
+
+	i = 0;
+	foreach(lc, index_pathkeys)
+	{
+		PathKey    *cpathkey = (PathKey *) lfirst(lc);
+
+		if (cpathkey == target_pathkey)
+		{
+			/*
+			 * Prefix expected to start from 1, increment positions since
+			 * they're 0 based.
+			 */
+			if (pathkeys_positions != NIL &&
+				pathkeys_positions->length > i)
+				return list_nth_int(pathkeys_positions, i) + 1;
+			else
+				return i + 1;
+		}
+
+		i++;
+	}
+
+	return 0;
+}
+
 /*
  * build_index_pathkeys
  *	  Build a pathkeys list that describes the ordering induced by an index
@@ -534,7 +575,9 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  * We iterate only key columns of covering indexes, since non-key columns
  * don't influence index ordering.  The result is canonical, meaning that
  * redundant pathkeys are removed; it may therefore have fewer entries than
- * there are key columns in the index.
+ * there are key columns in the index. Since by removing redundant pathkeys the
+ * information about original position is lost, return it via positions
+ * argument.
  *
  * Another reason for stopping early is that we may be able to tell that
  * an index column's sort order is uninteresting for this query.  However,
@@ -545,7 +588,8 @@ get_cheapest_parallel_safe_total_inner(List *paths)
 List *
 build_index_pathkeys(PlannerInfo *root,
 					 IndexOptInfo *index,
-					 ScanDirection scandir)
+					 ScanDirection scandir,
+					 List **positions)
 {
 	List	   *retval = NIL;
 	ListCell   *lc;
@@ -554,6 +598,8 @@ build_index_pathkeys(PlannerInfo *root,
 	if (index->sortopfamily == NULL)
 		return NIL;				/* non-orderable index */
 
+	*positions = NIL;
+
 	i = 0;
 	foreach(lc, index->indextlist)
 	{
@@ -607,7 +653,11 @@ build_index_pathkeys(PlannerInfo *root,
 			 * for this query.  Add it to list, unless it's redundant.
 			 */
 			if (!pathkey_is_redundant(cpathkey, retval))
+			{
 				retval = lappend(retval, cpathkey);
+				*positions = lappend_int(*positions,
+										 foreach_current_index(lc));
+			}
 		}
 		else
 		{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 94280a730c..d62e96c24c 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -177,12 +177,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 								 Oid indexid, List *indexqual, List *indexqualorig,
 								 List *indexorderby, List *indexorderbyorig,
 								 List *indexorderbyops,
-								 ScanDirection indexscandir);
+								 ScanDirection indexscandir,
+								 int skipprefix);
 static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 Index scanrelid, Oid indexid,
 										 List *indexqual, List *indexorderby,
 										 List *indextlist,
-										 ScanDirection indexscandir);
+										 ScanDirection indexscandir,
+										 int skipprefix);
 static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
 											  List *indexqual,
 											  List *indexqualorig);
@@ -2987,7 +2989,8 @@ create_indexscan_plan(PlannerInfo *root,
 												fixed_indexquals,
 												fixed_indexorderbys,
 												best_path->indexinfo->indextlist,
-												best_path->indexscandir);
+												best_path->indexscandir,
+												best_path->indexskipprefix);
 	else
 		scan_plan = (Scan *) make_indexscan(tlist,
 											qpqual,
@@ -2998,7 +3001,8 @@ create_indexscan_plan(PlannerInfo *root,
 											fixed_indexorderbys,
 											indexorderbys,
 											indexorderbyops,
-											best_path->indexscandir);
+											best_path->indexscandir,
+											best_path->indexskipprefix);
 
 	copy_generic_path_info(&scan_plan->plan, &best_path->path);
 
@@ -5265,7 +5269,8 @@ make_indexscan(List *qptlist,
 			   List *indexorderby,
 			   List *indexorderbyorig,
 			   List *indexorderbyops,
-			   ScanDirection indexscandir)
+			   ScanDirection indexscandir,
+			   int skipPrefixSize)
 {
 	IndexScan  *node = makeNode(IndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5282,6 +5287,7 @@ make_indexscan(List *qptlist,
 	node->indexorderbyorig = indexorderbyorig;
 	node->indexorderbyops = indexorderbyops;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
@@ -5294,7 +5300,8 @@ make_indexonlyscan(List *qptlist,
 				   List *indexqual,
 				   List *indexorderby,
 				   List *indextlist,
-				   ScanDirection indexscandir)
+				   ScanDirection indexscandir,
+				   int skipPrefixSize)
 {
 	IndexOnlyScan *node = makeNode(IndexOnlyScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5309,6 +5316,7 @@ make_indexonlyscan(List *qptlist,
 	node->indexorderby = indexorderby;
 	node->indextlist = indextlist;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index abcfe4d9d7..0e682920f6 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3004,6 +3004,43 @@ create_upper_unique_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_skipscan_unique_path
+ *	  Creates a pathnode the same as an existing IndexPath except based on
+ *	  skipping duplicate values.  This may or may not be cheaper than using
+ *	  create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root, IndexOptInfo *index,
+							Path *basepath, int prefix)
+{
+	IndexPath 	*pathnode = makeNode(IndexPath);
+	int 		numDistinctRows;
+	UniqueKey *ukey;
+
+	Assert(IsA(basepath, IndexPath));
+
+	/* We don't want to modify basepath, so make a copy. */
+	memcpy(pathnode, basepath, sizeof(IndexPath));
+
+	ukey = linitial_node(UniqueKey, root->query_uniquekeys);
+
+	Assert(prefix > 0);
+	pathnode->indexskipprefix = prefix;
+	pathnode->path.uniquekeys = root->query_uniquekeys;
+
+	numDistinctRows = estimate_num_groups(root, ukey->exprs,
+										  pathnode->path.rows,
+										  NULL);
+
+	pathnode->path.total_cost = pathnode->path.startup_cost * numDistinctRows;
+	pathnode->path.rows = numDistinctRows;
+
+	return pathnode;
+}
+
 /*
  * create_agg_path
  *	  Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index bdd4876a84..6fd82ea6fd 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -272,6 +272,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanskip = (amroutine->amskip != NULL);
 			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a62d64eaa4..005de5fb56 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -957,6 +957,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-skip-scan plans."),
+			NULL
+		},
+		&enable_indexskipscan,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..d5d7b7834d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -359,6 +359,7 @@
 #enable_hashjoin = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexskipscan = on
 #enable_material = on
 #enable_mergejoin = on
 #enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 85b4766016..9f0e81ec94 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -172,6 +172,12 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+								 ScanDirection dir,
+								 ScanDirection indexdir,
+								 int prefix);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -274,6 +280,7 @@ typedef struct IndexAmRoutine
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
 	amrestrpos_function amrestrpos; /* can be NULL */
+	amskip_function amskip;				/* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 68d90f5141..f3c5668dd0 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -180,6 +180,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
 												   IndexBulkDeleteResult *stats);
 extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction,
+					   ScanDirection indexdir, bool start, int prefix);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/sdir.h b/src/include/access/sdir.h
index 23feb90986..094a127464 100644
--- a/src/include/access/sdir.h
+++ b/src/include/access/sdir.h
@@ -55,4 +55,11 @@ typedef enum ScanDirection
 #define ScanDirectionIsForward(direction) \
 	((bool) ((direction) == ForwardScanDirection))
 
+/*
+ * ScanDirectionsAreOpposite
+ *		True iff scan directions are backward/forward or forward/backward.
+ */
+#define ScanDirectionsAreOpposite(dirA, dirB) \
+	((bool) (dirA != NoMovementScanDirection && dirA == -dirB))
+
 #endif							/* SDIR_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6c0a7d68d6..0b0464c0b1 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1425,6 +1425,8 @@ typedef struct IndexScanState
 	ExprContext *iss_RuntimeContext;
 	Relation	iss_RelationDesc;
 	struct IndexScanDescData *iss_ScanDesc;
+	int         iss_SkipPrefixSize;
+	bool		iss_FirstTupleEmitted;
 
 	/* These are needed for re-checking ORDER BY expr ordering */
 	pairingheap *iss_ReorderQueue;
@@ -1454,6 +1456,8 @@ typedef struct IndexScanState
  *		TableSlot		   slot for holding tuples fetched from the table
  *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
+ *		SkipPrefixSize	   number of keys for skip-based DISTINCT
+ *		FirstTupleEmitted  has the first tuple been emitted
  * ----------------
  */
 typedef struct IndexOnlyScanState
@@ -1472,6 +1476,8 @@ typedef struct IndexOnlyScanState
 	struct IndexScanDescData *ioss_ScanDesc;
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
+	int         ioss_SkipPrefixSize;
+	bool		ioss_FirstTupleEmitted;
 	Size		ioss_PscanLen;
 } IndexOnlyScanState;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 4a4ae3e926..bbab87411f 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1228,6 +1228,9 @@ typedef struct Path
  * we need not recompute them when considering using the same index in a
  * bitmap index/heap scan (see BitmapHeapPath).  The costs of the IndexPath
  * itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
  *----------
  */
 typedef struct IndexPath
@@ -1240,6 +1243,7 @@ typedef struct IndexPath
 	ScanDirection indexscandir;
 	Cost		indextotalcost;
 	Selectivity indexselectivity;
+	int			indexskipprefix;
 } IndexPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 7e6b10f86b..aed2ceacd0 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -401,6 +401,8 @@ typedef struct IndexScan
 	List	   *indexorderbyorig;	/* the same in original form */
 	List	   *indexorderbyops;	/* OIDs of sort ops for ORDER BY exprs */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for distinct
+										 * scans */
 } IndexScan;
 
 /* ----------------
@@ -428,6 +430,8 @@ typedef struct IndexOnlyScan
 	List	   *indexorderby;	/* list of index ORDER BY exprs */
 	List	   *indextlist;		/* TargetEntry list describing index's cols */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for distinct
+										 * scans */
 } IndexOnlyScan;
 
 /* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 6141654e47..67192a9188 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
 extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 6796ad8cb7..8ec1780a56 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -207,6 +207,10 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
 												 Path *subpath,
 												 int numCols,
 												 double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+											  IndexOptInfo *index,
+											  Path *subpath,
+											  int prefix);
 extern AggPath *create_agg_path(PlannerInfo *root,
 								RelOptInfo *rel,
 								Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 0cb8030e33..6ebcb0e0a4 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -198,8 +198,11 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   Relids required_outer,
 													   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
+extern int find_index_prefix_for_pathkey(List *index_pathkeys,
+										 List *pathkey_positions,
+										 PathKey *target_pathkey);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
-								  ScanDirection scandir);
+								  ScanDirection scandir, List **positions);
 extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
 									  ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 81bdacf59d..5cbc2c49cf 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -95,6 +95,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexskipscan           | on
  enable_material                | on
  enable_mergejoin               | on
  enable_nestloop                | on
@@ -106,7 +107,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(18 rows)
+(21 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
-- 
2.21.0

v37-0005-Btree-implementation-of-skipping.patchtext/x-diff; charset=us-asciiDownload

From 57235546ea9a02c4d68496df1c584b93d5a093d9 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:45:08 +0200
Subject: [PATCH v37 5/6] Btree implementation of skipping

Implementation of Index Skip Scan for Btree. To make it suitable for
both situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 src/backend/access/nbtree/nbtree.c            |  13 +
 src/backend/access/nbtree/nbtsearch.c         | 506 ++++++++++++-
 src/include/access/nbtree.h                   |   7 +
 src/test/regress/expected/join.out            |   3 +
 src/test/regress/expected/select_distinct.out | 666 ++++++++++++++++++
 src/test/regress/expected/sysviews.out        |   2 +-
 src/test/regress/sql/join.sql                 |   5 +
 src/test/regress/sql/select_distinct.sql      | 286 ++++++++
 8 files changed, 1486 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c822b49a71..8156660585 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -136,6 +136,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
+	amroutine->amskip = btskip;
 	amroutine->amcostestimate = btcostestimate;
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
@@ -386,6 +387,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	 */
 	so->currTuples = so->markTuples = NULL;
 
+	so->skipScanKey = NULL;
+
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
@@ -453,6 +456,16 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	_bt_preprocess_array_keys(scan);
 }
 
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction,
+	   ScanDirection indexdir, int prefix)
+{
+	return _bt_skip(scan, direction, indexdir, prefix);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8f6575fdf1..10704eb3d5 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -45,7 +45,11 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+											Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan,
+										   BTScanInsert key,
+										   Buffer buf);
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -1486,6 +1490,452 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 	return true;
 }
 
+/*
+ *  _bt_skip() -- Skip items that have the same prefix as the most recently
+ * 				  fetched index tuple.
+ *
+ * 		The current position is set so that a subsequent call to _bt_next will
+ * 		fetch the first tuple that differs in the leading 'prefix' keys.
+ *
+ * 		There are four different kinds of skipping (depending on dir and
+ * 		indexdir, that are important to distinguish, especially in the presense
+ * 		of an index condition:
+ *
+ * 		* Advancing forward and reading forward
+ * 			simple scan
+ *
+ * 		* Advancing forward and reading backward
+ * 			scan inside a cursor fetching backward, when skipping is necessary
+ * 			right from the start
+ *
+ * 		* Advancing backward and reading forward
+ * 			scan with order by desc inside a cursor fetching forward, when
+ * 			skipping is necessary right from the start
+ *
+ * 		* Advancing backward and reading backward
+ * 			simple scan with order by desc
+ *
+ *      The current page is searched for the next unique value. If none is found
+ *      we will do a scan from the root in order to find the next page with
+ *      a unique value.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir,
+		 ScanDirection indexdir, int prefix)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTStack stack;
+	Buffer buf;
+	OffsetNumber offnum;
+	BTScanPosItem *currItem;
+	Relation 	 indexRel = scan->indexRelation;
+	bool scanstart = !BTScanPosIsValid(so->currPos);
+
+	/* We want to return tuples, and we need a starting point */
+	Assert(scan->xs_want_itup);
+	Assert(scan->xs_itup);
+
+	if (so->numKilled > 0)
+		_bt_killitems(scan);
+
+	/* If skipScanKey is NULL then we initialize it with _bt_mkscankey */
+	if (so->skipScanKey == NULL)
+	{
+		so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+		so->skipScanKey->keysz = prefix;
+		so->skipScanKey->scantid = NULL;
+	}
+	so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+	_bt_update_skip_scankeys(scan, indexRel);
+
+	/* Check if the next unique key can be found within the current page.
+	 * Since we do not lock the current page between jumps, it's possible
+	 * that it was splitted since the last time we saw it. This is fine in
+	 * case of scanning forward, since page split to the right and we are
+	 * still on the left most page. In case of scanning backwards it's
+	 * possible to loose some pages and we need to remember the previous
+	 * page, and then follow the right link from the current page until we
+	 * find the original one.
+	 *
+	 * Since the whole idea of checking the current page is to protect
+	 * ourselves and make more performant statistic mismatch case when
+	 * there are too many distinct values for jumping, it's not clear if
+	 * the complexity of this solution in case of backward scan is
+	 * justified, so for now just avoid it.
+	 */
+	if (BufferIsValid(so->currPos.buf) && ScanDirectionIsForward(dir))
+	{
+		_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+
+		if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf))
+		{
+			bool keyFound = false;
+
+			offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+			/* Lock the page for SERIALIZABLE transactions */
+			PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+							  scan->xs_snapshot);
+
+			/* We know in which direction to look */
+			_bt_initialize_more_data(so, dir);
+
+			/* Now read the data */
+			keyFound = _bt_readpage(scan, dir, offnum);
+
+			_bt_unlockbuf(indexRel, so->currPos.buf);
+			ReleaseBuffer(so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+
+			if (keyFound)
+			{
+				/* set IndexTuple */
+				currItem = &so->currPos.items[so->currPos.itemIndex];
+				scan->xs_heaptid = currItem->heapTid;
+				scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+				return true;
+			}
+		}
+		else
+			_bt_unlockbuf(indexRel, so->currPos.buf);
+	}
+
+	if (BufferIsValid(so->currPos.buf))
+	{
+		ReleaseBuffer(so->currPos.buf);
+		so->currPos.buf = InvalidBuffer;
+	}
+
+	/*
+	 * We haven't found scan key within the current page, so let's scan from
+	 * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+	 * number
+	 */
+	stack = _bt_search(scan->indexRelation, so->skipScanKey,
+					   &buf, BT_READ, scan->xs_snapshot);
+	_bt_freestack(stack);
+	so->currPos.buf = buf;
+	offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+	/* Lock the page for SERIALIZABLE transactions */
+	PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+					  scan->xs_snapshot);
+
+	/* We know in which direction to look */
+	_bt_initialize_more_data(so, dir);
+
+	/*
+	 * Simplest case is when both directions are forward, when we are already
+	 * at the next distinct key at the beginning of the series (so everything
+	 * else would be done in _bt_readpage)
+	 *
+	 * The case when both directions are backwards is also simple, but we need
+	 * to go one step back, since we need a last element from the previous
+	 * series.
+	 */
+	if ((ScanDirectionIsBackward(dir) && ScanDirectionIsBackward(indexdir)) ||
+		(ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir) & scanstart))
+		 offnum = OffsetNumberPrev(offnum);
+
+	/*
+	 * Andvance backward but read forward. At this moment we are at the next
+	 * distinct key at the beginning of the series. In case if scan just
+	 * started, we can read forward without doing anything else. Otherwise
+	 * find previous distinct key and the beginning of it's series and read
+	 * forward from there. To do so, go back one step, perform binary search
+	 * to find the first item in the series and let _bt_readpage do everything
+	 * else.
+	 */
+	else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir) && !scanstart)
+	{
+		/* Reading forward means we expect to see more data on the right */
+		so->currPos.moreRight = true;
+
+		offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+		/* One step back to find a previous value */
+		_bt_readpage(scan, dir, offnum);
+
+		_bt_unlockbuf(indexRel, so->currPos.buf);
+		if (_bt_next(scan, dir))
+		{
+			_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+			_bt_update_skip_scankeys(scan, indexRel);
+
+			/*
+			 * And now find the last item from the sequence for the
+			 * current, value with the intention do OffsetNumberNext. As a
+			 * result we end up on a first element from the sequence.
+			 */
+			if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf))
+				offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+			else
+			{
+				if (BufferIsValid(so->currPos.buf))
+				{
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_bt_killitems(scan);
+
+					_bt_unlockbuf(indexRel, so->currPos.buf);
+					ReleaseBuffer(so->currPos.buf);
+					so->currPos.buf = InvalidBuffer;
+				}
+
+				stack = _bt_search(scan->indexRelation, so->skipScanKey,
+								   &buf, BT_READ, scan->xs_snapshot);
+				_bt_freestack(stack);
+				so->currPos.buf = buf;
+				offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+			}
+		}
+		else
+		{
+			pfree(so->skipScanKey);
+			so->skipScanKey = NULL;
+			return false;
+		}
+	}
+
+	/*
+	 * Advance forward but read backward. At this moment we are at the next
+	 * distinct key at the beginning of the series. In case if scan just
+	 * started, we can go one step back and read forward without doing
+	 * anything else. Otherwise find the next distinct key and the beginning
+	 * of it's series, go one step back and read backward from there.
+	 *
+	 * An interesting situation can happen if one of distinct keys do not pass
+	 * a corresponding index condition at all. In this case reading backward
+	 * can lead to a previous distinct key being found, creating a loop. To
+	 * avoid that check the value to be returned, and jump one more time if
+	 * it's the same as at the beginning. Note that we do not check visibility
+	 * here, and dead tuples could also lead to the same situation. This has to
+	 * be checked on the caller side.
+	 */
+	else if (ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir) && !scanstart)
+	{
+		IndexTuple 	startItup = CopyIndexTuple(scan->xs_itup);
+		bool 		nextFound = false;
+
+		/* Reading backwards means we expect to see more data on the left */
+		so->currPos.moreLeft = true;
+
+		while (!nextFound)
+		{
+			IndexTuple itup;
+			OffsetNumber jumpOffset;
+			CHECK_FOR_INTERRUPTS();
+
+			/*
+			 * Find a next index tuple to update scan key. It could be at
+			 * the end, so check for max offset
+			 */
+			if (!_bt_readpage(scan, ForwardScanDirection, offnum))
+			{
+				/*
+				 * There's no actually-matching data on this page.  Try to
+				 * advance to the next page. Return false if there's no
+				 * matching data at all.
+				 */
+				_bt_unlockbuf(indexRel, so->currPos.buf);
+				if (!_bt_steppage(scan, dir))
+				{
+					pfree(so->skipScanKey);
+					so->skipScanKey = NULL;
+					return false;
+				}
+				_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+			}
+
+			currItem = &so->currPos.items[so->currPos.firstItem];
+			itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+			scan->xs_itup = itup;
+
+			_bt_update_skip_scankeys(scan, indexRel);
+			if (BufferIsValid(so->currPos.buf))
+			{
+				/* Before leaving current page, deal with any killed items */
+				if (so->numKilled > 0)
+					_bt_killitems(scan);
+
+				_bt_unlockbuf(indexRel, so->currPos.buf);
+				ReleaseBuffer(so->currPos.buf);
+				so->currPos.buf = InvalidBuffer;
+			}
+
+			stack = _bt_search(scan->indexRelation, so->skipScanKey,
+							   &buf, BT_READ, scan->xs_snapshot);
+			_bt_freestack(stack);
+			so->currPos.buf = buf;
+
+			/*
+			 * We need to remember the original offset after the jump,
+			 * since in case of looping this would be the next starting
+			 * point
+			 */
+			jumpOffset = offnum = _bt_binsrch(scan->indexRelation,
+											  so->skipScanKey, buf);
+			offnum = OffsetNumberPrev(offnum);
+
+			if (!_bt_readpage(scan, indexdir, offnum))
+			{
+				/*
+				 * There's no actually-matching data on this page.  Try to
+				 * advance to the next page. Return false if there's no
+				 * matching data at all.
+				 */
+				_bt_unlockbuf(indexRel, so->currPos.buf);
+				if (!_bt_steppage(scan, indexdir))
+				{
+					pfree(so->skipScanKey);
+					so->skipScanKey = NULL;
+					return false;
+				}
+				_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+			}
+
+			currItem = &so->currPos.items[so->currPos.lastItem];
+			itup = CopyIndexTuple((IndexTuple)
+					(so->currTuples + currItem->tupleOffset));
+
+			/*
+			 * To check if we returned the same tuple, try to find a
+			 * startItup on the current page. For that we need to update
+			 * scankey to match the whole tuple and set nextkey to return
+			 * an exact tuple, not the next one. If the tuple we found in
+			 * this way is equal to what we wanted to return, it means we
+			 * are in the loop, return offnum to the original position and
+			 * jump further
+			 *
+			 * Note that to compare tids we need to keep the leaf pinned,
+			 * otherwise there is a danger of vacuum cleaning up relevant
+			 * tuples.
+			 */
+			scan->xs_itup = startItup;
+			_bt_update_skip_scankeys(scan, indexRel);
+
+			so->skipScanKey->keysz = IndexRelationGetNumberOfKeyAttributes(indexRel);
+			so->skipScanKey->nextkey = false;
+
+			if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf))
+			{
+				OffsetNumber maxoff, startOffset;
+				IndexTuple verifiedItup;
+				Page page = BufferGetPage(so->currPos.buf);
+				startOffset = _bt_binsrch(scan->indexRelation,
+										  so->skipScanKey,
+										  so->currPos.buf);
+
+				maxoff = PageGetMaxOffsetNumber(page);
+
+				/* Now read the data */
+				if (_bt_readpage(scan, ForwardScanDirection, startOffset))
+				{
+					ItemPointer resultTids, verifyTids;
+					int nresult = 1,
+						nverify = 1;
+
+					currItem = &so->currPos.items[so->currPos.itemIndex];
+					verifiedItup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+					/*
+					 * We need to keep in mind that tuples we deal with
+					 * could be also posting tuples and represent a list of
+					 * tids.
+					 */
+					if (BTreeTupleIsPosting(verifiedItup))
+					{
+						nverify = BTreeTupleGetNPosting(verifiedItup);
+						verifyTids = BTreeTupleGetPosting(verifiedItup);
+						for (int i = 1; i < nverify; i++)
+							verifyTids[i] = *BTreeTupleGetPostingN(verifiedItup, i);
+					}
+					else
+						verifyTids = &verifiedItup->t_tid;
+
+					if (BTreeTupleIsPosting(itup))
+					{
+						nresult = BTreeTupleGetNPosting(itup);
+						resultTids = BTreeTupleGetPosting(itup);
+						for (int i = 1; i < nresult; i++)
+							resultTids[i] = *BTreeTupleGetPostingN(itup, i);
+					}
+					else
+						resultTids = &itup->t_tid;
+
+					/* One not equal means they're not equal. */
+					for(int i = 0; i < nverify; i++)
+					{
+						for(int j = 0; j < nresult; j++)
+						{
+							if (!ItemPointerEquals(&resultTids[j], &verifyTids[i]))
+							{
+								nextFound = true;
+								break;
+							}
+						}
+					}
+
+					if (!nextFound)
+						offnum = jumpOffset;
+				}
+
+				if ((offnum > maxoff) && (so->currPos.nextPage == P_NONE))
+				{
+					_bt_unlockbuf(indexRel, so->currPos.buf);
+
+					BTScanPosUnpinIfPinned(so->currPos);
+					BTScanPosInvalidate(so->currPos);
+
+					pfree(so->skipScanKey);
+					so->skipScanKey = NULL;
+					return false;
+				}
+			}
+			else
+				/*
+				 * If startItup could be not found within the current page,
+				 * assume we found something new
+				 */
+				nextFound = true;
+
+			/* Return original scankey options */
+			so->skipScanKey->keysz = prefix;
+			so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+		}
+	}
+
+	/* Now read the data */
+	if (!_bt_readpage(scan, indexdir, offnum))
+	{
+		/*
+		 * There's no actually-matching data on this page.  Try to advance to
+		 * the next page.  Return false if there's no matching data at all.
+		 */
+		_bt_unlockbuf(indexRel, so->currPos.buf);
+		if (!_bt_steppage(scan, dir))
+		{
+			pfree(so->skipScanKey);
+			so->skipScanKey = NULL;
+			return false;
+		}
+	}
+	else
+		/* Drop the lock, and maybe the pin, on the current page */
+		_bt_unlockbuf(indexRel, so->currPos.buf);
+
+	/* And set IndexTuple */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_heaptid = currItem->heapTid;
+	scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+	so->currPos.moreLeft = true;
+	so->currPos.moreRight = true;
+
+	return true;
+}
+
 /*
  *	_bt_readpage() -- Load data from current index page into so->currPos
  *
@@ -2482,3 +2932,57 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 	so->numKilled = 0;			/* just paranoia */
 	so->markItemIndex = -1;		/* ditto */
 }
+
+/*
+ * _bt_update_skip_scankeys() -- set up a new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+	TupleDesc		itupdesc;
+	int			indnkeyatts,
+				i;
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	ScanKey			scankeys = so->skipScanKey->scankeys;
+
+	itupdesc = RelationGetDescr(indexRel);
+	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+	for (i = 0; i < indnkeyatts; i++)
+	{
+		Datum datum;
+		bool null;
+		int flags;
+
+		datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+		flags = (null ? SK_ISNULL : 0) |
+				(indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+		scankeys[i].sk_flags = flags;
+		scankeys[i].sk_argument = datum;
+	}
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ *
+ * Scankey nextkey will tell us if we need to find a current key or the next
+ * one, which affects whether or not it's ok to be equal to the page highkey.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, Buffer buf)
+{
+	OffsetNumber low, high;
+	Page page = BufferGetPage(buf);
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	int high_compare = key->nextkey ? 0 : 1;
+
+	low = P_FIRSTDATAKEY(opaque);
+	high = PageGetMaxOffsetNumber(page);
+
+	if (unlikely(high < low))
+		return false;
+
+	return (_bt_compare(scan->indexRelation, key, page, low) > 0 &&
+			_bt_compare(scan->indexRelation, key, page, high) < high_compare);
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 65d9698b89..5cffc9b5ae 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -938,6 +938,9 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* Work space for _bt_skip */
+	BTScanInsert	skipScanKey;	/* used to control skipping */
+
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -1099,6 +1102,8 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir,
+					 ScanDirection indexdir, int prefix);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 							   Snapshot snapshot);
 
@@ -1123,6 +1128,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
 extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir,
+				   ScanDirection indexdir, int prefix);
 extern bool btproperty(Oid index_oid, int attno,
 					   IndexAMProperty prop, const char *propname,
 					   bool *res, bool *isnull);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index a46b1573bd..a47e9faed2 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4429,6 +4429,8 @@ select d.* from d left join (select * from b group by b.id, b.c_id) s
          ->  Seq Scan on d
 (8 rows)
 
+-- disable index skip scan to prevent it interfering with the plan
+set enable_indexskipscan to off;
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
@@ -4446,6 +4448,7 @@ select d.* from d left join (select distinct * from b) s
          ->  Seq Scan on d
 (9 rows)
 
+set enable_indexskipscan to on;
 -- check join removal works when uniqueness of the join condition is enforced
 -- by a UNION
 explain (costs off)
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index 11c6f50fbf..ea8c9bee49 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -306,3 +306,669 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
  t
 (1 row)
 
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+CREATE INDEX ON distinct_a ((a + 1));
+ANALYZE distinct_a;
+SELECT DISTINCT a FROM distinct_a;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+ a 
+---
+ 1
+(1 row)
+
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Index Only Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+(2 rows)
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- test index skip scan for expressions
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT (a + 1) FROM distinct_a;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Scan using distinct_a_expr_idx on distinct_a
+   Skip scan: true
+(2 rows)
+
+SELECT DISTINCT (a + 1) FROM distinct_a;
+ ?column? 
+----------
+        2
+        3
+        4
+        5
+        6
+(5 rows)
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: true
+   Index Cond: (b = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: true
+   Index Cond: (b = 2)
+(3 rows)
+
+DROP INDEX distinct_a_b_a;
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b 
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+                          QUERY PLAN                          
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: true
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+                              QUERY PLAN                               
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: true
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+ 2 | 1 | 10
+ 3 | 1 | 10
+ 4 | 1 | 10
+ 5 | 1 | 10
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+   Index Cond: (a = 1)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT *
+FROM distinct_a;
+          QUERY PLAN          
+------------------------------
+ HashAggregate
+   Group Key: a, b, c
+   ->  Seq Scan on distinct_a
+(3 rows)
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Unique
+   ->  Index Scan using distinct_a_a_b_idx on distinct_a
+         Index Cond: (b = 2)
+         Filter: (c = 10)
+(4 rows)
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+ a | a 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 3
+ 4 | 4
+ 5 | 5
+(5 rows)
+
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+ a | ?column? 
+---+----------
+ 1 |        1
+ 2 |        1
+ 3 |        1
+ 4 |        1
+ 5 |        1
+(5 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+FETCH FROM c;
+ a 
+---
+ 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a 
+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 |  9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+                                 QUERY PLAN                                 
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+   Skip scan: true
+   Index Cond: ((b >= 1) AND (c = 0))
+(3 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c 
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
+-- test tuple killing
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 1 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 5 | 1 | 1 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 5 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 1 | 1 | 1 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(5 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(5 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- test posting lists
+CREATE TABLE distinct_posting (a int, b int, c int);
+CREATE INDEX ON distinct_posting (a, b, c);
+INSERT INTO distinct_posting
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+INSERT INTO distinct_posting (
+    SELECT 1 as a, 1 as b, 1 AS c
+        FROM generate_series(1,1000) i
+);
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c FROM distinct_posting WHERE c = 2
+    ORDER BY a DESC, b DESC;
+    FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+COMMIT;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 5cbc2c49cf..a3465bb77e 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -107,7 +107,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(21 rows)
+(19 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 1403e0ffe7..13e35fda88 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1519,11 +1519,16 @@ explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
   on d.a = s.id;
 
+-- disable index skip scan to prevent it interfering with the plan
+set enable_indexskipscan to off;
+
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
   on d.a = s.id;
 
+set enable_indexskipscan to on;
+
 -- check join removal works when uniqueness of the join condition is enforced
 -- by a UNION
 explain (costs off)
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index 33102744eb..f732b1c658 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -135,3 +135,289 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
 SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
 SELECT 2 IS NOT DISTINCT FROM null as "no";
 SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+CREATE INDEX ON distinct_a ((a + 1));
+ANALYZE distinct_a;
+
+SELECT DISTINCT a FROM distinct_a;
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- test index skip scan for expressions
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT (a + 1) FROM distinct_a;
+SELECT DISTINCT (a + 1) FROM distinct_a;
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+DROP INDEX distinct_a_b_a;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT *
+FROM distinct_a;
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
+
+-- test tuple killing
+
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- test posting lists
+CREATE TABLE distinct_posting (a int, b int, c int);
+CREATE INDEX ON distinct_posting (a, b, c);
+INSERT INTO distinct_posting
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+INSERT INTO distinct_posting (
+    SELECT 1 as a, 1 as b, 1 AS c
+        FROM generate_series(1,1000) i
+);
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c FROM distinct_posting WHERE c = 2
+    ORDER BY a DESC, b DESC;
+    FETCH ALL FROM c;
+
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
-- 
2.21.0

v37-0006-Index-skip-scan-documentation.patchtext/x-diff; charset=us-asciiDownload

From 282f592d23178d0f40061370c7b3da9d76db6a43 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:45:21 +0200
Subject: [PATCH v37 6/6] Index skip scan documentation

Covers added GUC and the extenion of index am interface.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 doc/src/sgml/config.sgml  | 15 +++++++++++
 doc/src/sgml/indexam.sgml | 52 +++++++++++++++++++++++++++++++++++++++
 doc/src/sgml/indices.sgml | 23 +++++++++++++++++
 3 files changed, 90 insertions(+)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f043433e31..d205572b99 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4806,6 +4806,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+      <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of index-skip-scan plan
+        types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+        <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-material" xreflabel="enable_material">
       <term><varname>enable_material</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 1988192fbe..45b0b7b4f1 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -736,6 +736,58 @@ amrestrpos (IndexScanDesc scan);
 
   <para>
 <programlisting>
+bool
+amskip (IndexScanDesc scan,
+        ScanDirection direction,
+        ScanDirection indexdir,
+        int prefix);
+</programlisting>
+  Skip past all tuples where the first 'prefix' columns have the same value as
+  the last tuple returned in the current scan. The arguments are:
+
+   <variablelist>
+    <varlistentry>
+     <term><parameter>scan</parameter></term>
+     <listitem>
+      <para>
+       Index scan information
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>direction</parameter></term>
+     <listitem>
+      <para>
+       The direction in which data is advancing.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>indexdir</parameter></term>
+     <listitem>
+      <para>
+        The index direction, in which data must be read.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>prefix</parameter></term>
+     <listitem>
+      <para>
+        Distinct prefix size.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+
+  </para>
+
+  <para>
+<programlisting>
 Size
 amestimateparallelscan (void);
 </programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 671299ff05..07a28d2b5a 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1281,6 +1281,29 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
    and later will recognize such cases and allow index-only scans to be
    generated, but older versions will not.
   </para>
+
+  <sect2 id="indexes-index-skip-scans">
+    <title>Index Skip Scans</title>
+
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index</primary>
+      <secondary>index-skip scans</secondary>
+    </indexterm>
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index-skip scan</primary>
+    </indexterm>
+
+    <para>
+     When the rows retrieved from an index scan are then deduplicated by
+     eliminating rows matching on a prefix of index keys (e.g. when using
+     <literal>SELECT DISTINCT</literal>), the planner will consider
+     skipping groups of rows with a matching key prefix. When a row with
+     a particular prefix is found, remaining rows with the same key prefix
+     are skipped.  The larger the number of rows with the same key prefix
+     rows (i.e. the lower the number of distinct key prefixes in the index),
+     the more efficient this is.
+    </para>
+  </sect2>
  </sect1>
 
 
-- 
2.21.0

#23

Heikki Linnakangas

hlinnaka@iki.fi

about 5 years ago

In reply to: Dmitry Dolgov (#22)

1 attachment(s)

Re: Index Skip Scan (new UniqueKeys)

On 24/10/2020 19:45, Dmitry Dolgov wrote:

Here is a new version which doesn't require "scanstart" argument and
contains few other changes to address the issues mentioned earlier. It's
also based on the latest UniqueKeys patches with the valgrind issue
fixed (as before they're attached also just for the references, you can
find more in the original thread). I didn't rework commentaries yet,
will post it soon (need to get an inspiration first, probably via
reading Shakespeare unless someone has better suggestions).

I had a quick look at this patch. I haven't been following this thread,
so sorry if I'm repeating old arguments, but here we go:

- I'm surprised you need a new index AM function (amskip) for this.
Can't you just restart the scan with index_rescan()? The btree AM can
check if the new keys are on the same page, and optimize the rescan
accordingly, like amskip does. That would speed up e.g. nested loop
scans too, where the keys just happen to be clustered.

- Does this optimization apply to bitmap index scans?

- This logic in build_index_paths() is not correct:

+		/*
+		 * Skip scan is not supported when there are qual conditions, which are not
+		 * covered by index. The reason for that is that those conditions are
+		 * evaluated later, already after skipping was applied.
+		 *
+		 * TODO: This implementation is too restrictive, and doesn't allow e.g.
+		 * index expressions. For that we need to examine index_clauses too.
+		 */
+		if (root->parse->jointree != NULL)
+		{
+			ListCell *lc;
+
+			foreach(lc, (List *)root->parse->jointree->quals)
+			{
+				Node *expr, *qual = (Node *) lfirst(lc);
+				Var *var;
+				bool found = false;
+
+				if (!is_opclause(qual))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				expr = get_leftop(qual);
+
+				if (!IsA(expr, Var))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				var = (Var *) expr;
+
+				for (int i = 0; i < index->ncolumns; i++)
+				{
+					if (index->indexkeys[i] == var->varattno)
+					{
+						found = true;
+						break;
+					}
+				}
+
+				if (!found)
+				{
+					not_empty_qual = true;
+					break;
+				}
+			}
+		}

If you care whether the qual is evaluated by the index AM or not, you
need to also check that the operator is indexable. Attached is a query
that demonstrates that problem.

I'm actually a bit confused why we need this condition. The IndexScan
executor node should call amskip() only after checking the additional
quals, no?

Also, you should probably check that the index quals are in the operator
family as that used for the DISTINCT.

- Heikki

#24

9erthalion6@gmail.com

about 5 years ago

In reply to: Heikki Linnakangas (#23)

Re: Index Skip Scan (new UniqueKeys)

On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote:

I had a quick look at this patch. I haven't been following this thread, so
sorry if I'm repeating old arguments, but here we go:

Thanks!

- I'm surprised you need a new index AM function (amskip) for this. Can't
you just restart the scan with index_rescan()? The btree AM can check if the
new keys are on the same page, and optimize the rescan accordingly, like
amskip does. That would speed up e.g. nested loop scans too, where the keys
just happen to be clustered.

An interesting point. At the moment I'm not sure whether it's possible
to implement skipping via index_rescan or not, need to take a look. But
checking if the new keys are on the same page would introduce some
overhead I guess, wouldn't it be too invasive to add it into already
existing btree AM?

- Does this optimization apply to bitmap index scans?

No, from what I understand it doesn't.

- This logic in build_index_paths() is not correct:

+		/*
+		 * Skip scan is not supported when there are qual conditions, which are not
+		 * covered by index. The reason for that is that those conditions are
+		 * evaluated later, already after skipping was applied.
+		 *
+		 * TODO: This implementation is too restrictive, and doesn't allow e.g.
+		 * index expressions. For that we need to examine index_clauses too.
+		 */
+		if (root->parse->jointree != NULL)
+		{
+			ListCell *lc;
+
+			foreach(lc, (List *)root->parse->jointree->quals)
+			{
+				Node *expr, *qual = (Node *) lfirst(lc);
+				Var *var;
+				bool found = false;
+
+				if (!is_opclause(qual))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				expr = get_leftop(qual);
+
+				if (!IsA(expr, Var))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				var = (Var *) expr;
+
+				for (int i = 0; i < index->ncolumns; i++)
+				{
+					if (index->indexkeys[i] == var->varattno)
+					{
+						found = true;
+						break;
+					}
+				}
+
+				if (!found)
+				{
+					not_empty_qual = true;
+					break;
+				}
+			}
+		}

If you care whether the qual is evaluated by the index AM or not, you need
to also check that the operator is indexable. Attached is a query that
demonstrates that problem.
...
Also, you should probably check that the index quals are in the operator
family as that used for the DISTINCT.

Yes, good point, will change this in the next version.

I'm actually a bit confused why we need this condition. The IndexScan
executor node should call amskip() only after checking the additional quals,
no?

This part I don't quite get, what exactly you mean by checking the
additional quals in the executor node? But at the end of the day this
condition was implemented exactly to address the described issue, which
was found later and added to the tests.

#25

Heikki Linnakangas

hlinnaka@iki.fi

about 5 years ago

In reply to: Dmitry Dolgov (#24)

Re: Index Skip Scan (new UniqueKeys)

On 01/12/2020 22:21, Dmitry Dolgov wrote:

On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote:

I had a quick look at this patch. I haven't been following this thread, so
sorry if I'm repeating old arguments, but here we go:

Thanks!

- I'm surprised you need a new index AM function (amskip) for this. Can't
you just restart the scan with index_rescan()? The btree AM can check if the
new keys are on the same page, and optimize the rescan accordingly, like
amskip does. That would speed up e.g. nested loop scans too, where the keys
just happen to be clustered.

An interesting point. At the moment I'm not sure whether it's possible
to implement skipping via index_rescan or not, need to take a look. But
checking if the new keys are on the same page would introduce some
overhead I guess, wouldn't it be too invasive to add it into already
existing btree AM?

I think it'll be OK. But if it's not, you could add a hint argument to
index_rescan() to hint the index AM that the new key is known to be
greater than the previous key.

- Does this optimization apply to bitmap index scans?

No, from what I understand it doesn't.

Would it be hard to add? Don't need to solve everything in the first
version of this, but I think in principle you could do the same
optimization for bitmap index scans, so if the current API can't do it,
that's maybe an indication that the API isn't quite right.

- This logic in build_index_paths() is not correct:
+		/*
+		 * Skip scan is not supported when there are qual conditions, which are not
+		 * covered by index. The reason for that is that those conditions are
+		 * evaluated later, already after skipping was applied.
+		 *
+		 * TODO: This implementation is too restrictive, and doesn't allow e.g.
+		 * index expressions. For that we need to examine index_clauses too.
+		 */
+		if (root->parse->jointree != NULL)
+		{
+			ListCell *lc;
+
+			foreach(lc, (List *)root->parse->jointree->quals)
+			{
+				Node *expr, *qual = (Node *) lfirst(lc);
+				Var *var;
+				bool found = false;
+
+				if (!is_opclause(qual))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				expr = get_leftop(qual);
+
+				if (!IsA(expr, Var))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				var = (Var *) expr;
+
+				for (int i = 0; i < index->ncolumns; i++)
+				{
+					if (index->indexkeys[i] == var->varattno)
+					{
+						found = true;
+						break;
+					}
+				}
+
+				if (!found)
+				{
+					not_empty_qual = true;
+					break;
+				}
+			}
+		}
If you care whether the qual is evaluated by the index AM or not, you need
to also check that the operator is indexable. Attached is a query that
demonstrates that problem.
...
Also, you should probably check that the index quals are in the operator
family as that used for the DISTINCT.
Yes, good point, will change this in the next version.

I'm actually a bit confused why we need this condition. The IndexScan
executor node should call amskip() only after checking the additional quals,
no?

This part I don't quite get, what exactly you mean by checking the
additional quals in the executor node? But at the end of the day this
condition was implemented exactly to address the described issue, which
was found later and added to the tests.

As I understand this, the executor logic goes like this:

query: SELECT DISTINCT ON (a, b) a, b FROM foo where c like '%y%' and a
like 'a%' and b = 'b';

1. Call index_beginscan, keys: a >= 'a', b = 'b'

2. Call index_getnext, which returns first row to the Index Scan node

3. Evaluates the qual "c like '%y%'" on the tuple. If it's false, goto
step 2 to get next tuple.

4. Return tuple to parent node

5. index_amskip(), to the next tuple with a > 'a'. Goto 2.

The logic should work fine, even if there are quals that are not
indexable, like "c like '%y'" in the above example. So why doesn't it
work? What am I missing?

- Heikki

#26

[1]: /messages/by-id/CADLWmXWALK8NPZqdnRQiPnrzAnic7NxYKynrkzO_vxYr8enWww@mail.gmail.com

9erthalion6@gmail.com

about 5 years ago

In reply to: Heikki Linnakangas (#25)

Re: Index Skip Scan (new UniqueKeys)

On Tue, Dec 01, 2020 at 10:59:22PM +0200, Heikki Linnakangas wrote:

- Does this optimization apply to bitmap index scans?

No, from what I understand it doesn't.

Would it be hard to add? Don't need to solve everything in the first
version of this, but I think in principle you could do the same
optimization for bitmap index scans, so if the current API can't do it,
that's maybe an indication that the API isn't quite right.

I would expect it should not be hard as at the moment all parts seems
relatively generic. But of course I need to check, while it seems no one
had bitmap index scans in mind while developing this patch.

I'm actually a bit confused why we need this condition. The IndexScan
executor node should call amskip() only after checking the additional quals,
no?

This part I don't quite get, what exactly you mean by checking the
additional quals in the executor node? But at the end of the day this
condition was implemented exactly to address the described issue, which
was found later and added to the tests.

As I understand this, the executor logic goes like this:

query: SELECT DISTINCT ON (a, b) a, b FROM foo where c like '%y%' and a
like 'a%' and b = 'b';

1. Call index_beginscan, keys: a >= 'a', b = 'b'

2. Call index_getnext, which returns first row to the Index Scan node

3. Evaluates the qual "c like '%y%'" on the tuple. If it's false, goto step
2 to get next tuple.

4. Return tuple to parent node

5. index_amskip(), to the next tuple with a > 'a'. Goto 2.

The logic should work fine, even if there are quals that are not indexable,
like "c like '%y'" in the above example. So why doesn't it work? What am I
missing?

To remind myself how it works I went through this sequence, and from
what I understand the qual "c like '%y%'" is evaluated in this case in
ExecQual, not after index_getnext_tid (and values returned after
skipping are reported as filtered out). So when it comes to index_skip
only quals on a & b were evaluated. Or did you mean something else?

Another small detail is that in the current implementation there is no
goto 2 in the last step. Originally it was like that, but since skipping
return an exact position that we need there was something like "find a
value, then do one step back so that index_getnext will find it".
Unfortunately this stepping back part turns out to be a source of
troubles, and getting rid of it even allowed to make code somewhat more
concise. But of course I'm open for suggestions about improvements.

#27

Thomas Munro

thomas.munro@gmail.com

about 5 years ago

In reply to: Heikki Linnakangas (#25)

Re: Index Skip Scan (new UniqueKeys)

On Wed, Dec 2, 2020 at 9:59 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 01/12/2020 22:21, Dmitry Dolgov wrote:

On Mon, Nov 30, 2020 at 04:42:20PM +0200, Heikki Linnakangas wrote:
- I'm surprised you need a new index AM function (amskip) for this. Can't
you just restart the scan with index_rescan()? The btree AM can check if the
new keys are on the same page, and optimize the rescan accordingly, like
amskip does. That would speed up e.g. nested loop scans too, where the keys
just happen to be clustered.

An interesting point. At the moment I'm not sure whether it's possible
to implement skipping via index_rescan or not, need to take a look. But
checking if the new keys are on the same page would introduce some
overhead I guess, wouldn't it be too invasive to add it into already
existing btree AM?

I think it'll be OK. But if it's not, you could add a hint argument to
index_rescan() to hint the index AM that the new key is known to be
greater than the previous key.

FWIW here's what I wrote about that years ago[1]/messages/by-id/CADLWmXWALK8NPZqdnRQiPnrzAnic7NxYKynrkzO_vxYr8enWww@mail.gmail.com:

It works by adding a new index operation 'skip' which the executor
code can use during a scan to advance to the next value (for some
prefix of the index's columns). That may be a terrible idea and
totally unnecessary... but let me explain my
reasoning:

1. Perhaps some index implementations can do something better than a
search for the next key value from the root. Is it possible or
desirable to use the current position as a starting point for a btree
traversal? I don't know.

2. It seemed that I'd need to create a new search ScanKey to use the
'rescan' interface for skipping to the next value, but I already had
an insertion ScanKey so I wanted a way to just reuse that. But maybe
there is some other way to reuse existing index interfaces, or maybe
there is an easy way to make a new search ScanKey from the existing
insertion ScanKey?

#28

Masahiko Sawada

sawada.mshk@gmail.com

almost 5 years ago

In reply to: Dmitry Dolgov (#22)

Re: Index Skip Scan (new UniqueKeys)

Hi Dmitry,

On Sun, Oct 25, 2020 at 1:45 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Tue, Oct 06, 2020 at 05:20:39PM +0200, Dmitry Dolgov wrote:

On Mon, Sep 21, 2020 at 05:59:32PM -0700, Peter Geoghegan wrote:

* I see the following compiler warning:

/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:
In function ‘populate_baserel_uniquekeys’:
/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13:
warning: ‘expr’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
797 | else if (!list_member(unique_index->rel->reltarget->exprs, expr))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is mostly for UniqueKeys patch, which is attached here only as a
dependency, but I'll prepare changes for that. Interesting enough I
can't reproduce this warning, but if I understand correctly gcc has some
history of spurious uninitialized warnings, so I guess it could be
version dependent.

* Perhaps the warning is related to this nearby code that I noticed
Valgrind complains about:

==1083468== VALGRINDERROR-BEGIN
==1083468== Invalid read of size 4
==1083468== at 0x59568A: get_exprs_from_uniqueindex (uniquekeys.c:771)
==1083468== by 0x593C5B: populate_baserel_uniquekeys (uniquekeys.c:140)

This also belongs to UniqueKeys patch, but at least I can reproduce this
one. My guess is that nkeycolums should be used there, not ncolums,
which is visible in index_incuding tests. The same as previous one, will
prepare corresponding changes.

* Do we really need the AM-level boolean flag/argument named
"scanstart"? Why not just follow the example of btgettuple(), which
determines whether or not the scan has been initialized based on the
current scan position?

Just because you set so->currPos.buf to InvalidBuffer doesn't mean you
cannot or should not take the same approach as btgettuple(). And even
if you can't take exactly the same approach, I would still think that
the scan's opaque B-Tree state should remember if it's the first call
to _bt_skip() (rather than some subsequent call) in some other way
(e.g. carrying a "scanstart" bool flag directly).

Yes, agree, carrying this flag inside the opaque state would be better.

Here is a new version which doesn't require "scanstart" argument and
contains few other changes to address the issues mentioned earlier. It's
also based on the latest UniqueKeys patches with the valgrind issue
fixed (as before they're attached also just for the references, you can
find more in the original thread). I didn't rework commentaries yet,
will post it soon (need to get an inspiration first, probably via
reading Shakespeare unless someone has better suggestions).

* Why is it okay to do anything important based on the
_bt_scankey_within_page() return value?

If the page is empty, then how can we know that it's okay to go to the
next value? I'm concerned that there could be subtle bugs in this
area. VACUUM will usually just delete the empty page. But it won't
always do so, for a variety of reasons that aren't worth going into
now. This could mask bugs in this area. I'm concerned about patterns
like this one from _bt_skip():

while (!nextFound)
{
....

if (_bt_scankey_within_page(scan, so->skipScanKey,
so->currPos.buf, dir))
{
...
}
else
/*
* If startItup could be not found within the current page,
* assume we found something new
*/
nextFound = true;
....
}

Why would you assume that "we found something new" here? In general I
just don't understand the design of _bt_skip(). I get the basic idea
of what you're trying to do, but it could really use better comments.

Yeah, I'll put more efforts into clear comments. There are two different
ways in which _bt_scankey_within_page is being used.

The first one is to check if it's possible to skip traversal of the tree
from root in case if what we're looking for could be on the current
page. In this case an empty page would mean we need to search from the
root, so not sure what could be the issue here?

The second one (that you've highlighted above) I admit is probably the
most questionable part of the patch and open for suggestions how to
improve it. It's required for one particular case with a cursor when
scan advances forward but reads backward. What could happen here is we
found one valid item, but the next one e.g. do not pass scan key
conditions, and we end up with the previous item again. I'm not entirely
sure how presence of an empty page could change this scenario, could you
please show an example?

*The "jump one more time if it's the same as at the beginning" thing
seems scary to me. Maybe you should be doing something with the actual
high key here.

Same as for the previous question, can you give a hint what do you mean
by "doing something with the actual high key"?

The question is still there and I would really appreciate clarification
about what exactly scenarios I need to look for with empty pages. I've
tried to perform testing with "attempt_pagedel = false" suggestion, but
didn't find anything suspicious.

Status update for a commitfest entry.

This patch entry has been "Waiting on Author" on CF app and the
discussion seems inactive from the last CF. Could you share the
current status of this patch? Heikki already sent review comments and
there was a discussion but the WoA status is correct? If it needs
reviews, please rebase the patches and set it to "Needs Reviews" on CF
app. If you're not working on this, I'm going to set it to "Returned
with Feedback", barring objections.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#29

9erthalion6@gmail.com

almost 5 years ago

In reply to: Masahiko Sawada (#28)

Re: Index Skip Scan (new UniqueKeys)

On Thu, Jan 28, 2021 at 09:49:26PM +0900, Masahiko Sawada wrote:
Hi Dmitry,

Status update for a commitfest entry.

This patch entry has been "Waiting on Author" on CF app and the
discussion seems inactive from the last CF. Could you share the
current status of this patch? Heikki already sent review comments and
there was a discussion but the WoA status is correct? If it needs
reviews, please rebase the patches and set it to "Needs Reviews" on CF
app. If you're not working on this, I'm going to set it to "Returned
with Feedback", barring objections.

Yes, I'm still on it. In fact, I've sketched up almost immediately
couple of changes to address Heikki feedback, but was distracted by
subscripting stuff. Will try to send new version of the patch soon.

#30

[1]: /messages/by-id/CAKU4AWpQjAqJwQ2X-aR9g3+ZHRzU1k8hNP7A+_mLuOv-n5aVKA@mail.gmail.com
[2]: /messages/by-id/CAKU4AWrU35c9g3cE15JmVwh6B2Hzf4hf7cZUkRsiktv7AKR3Ag@mail.gmail.com

9erthalion6@gmail.com

almost 5 years ago

In reply to: Dmitry Dolgov (#26)

6 attachment(s)

Re: Index Skip Scan (new UniqueKeys)

On Sat, Dec 05, 2020 at 06:55:42PM +0100, Dmitry Dolgov wrote:

On Tue, Dec 01, 2020 at 10:59:22PM +0200, Heikki Linnakangas wrote:

- Does this optimization apply to bitmap index scans?

No, from what I understand it doesn't.

Would it be hard to add? Don't need to solve everything in the first
version of this, but I think in principle you could do the same
optimization for bitmap index scans, so if the current API can't do it,
that's maybe an indication that the API isn't quite right.

I would expect it should not be hard as at the moment all parts seems
relatively generic. But of course I need to check, while it seems no one
had bitmap index scans in mind while developing this patch.

On Sun, Dec 06, 2020 at 09:27:08AM +1300, Thomas Munro wrote:

FWIW here's what I wrote about that years ago[1]:
[1] /messages/by-id/CADLWmXWALK8NPZqdnRQiPnrzAnic7NxYKynrkzO_vxYr8enWww@mail.gmail.com

Thanks, that clarifies this topic a bit.

If you care whether the qual is evaluated by the index AM or not, you need
to also check that the operator is indexable. Attached is a query that
demonstrates that problem.
...
Also, you should probably check that the index quals are in the operator
family as that used for the DISTINCT.

Yes, good point, will change this in the next version.

Sorry for such long silence, now I've got a bit of free time after
subscripting patch to work on this one. Here is rebased version with a
few changes to address Heikki feedback about checking if the qual
operator is indexable. But...

This version is based on the old version of UniqueKey patch (first two
attached patches), mostly because IIUC there is still no final version
of it ([1]/messages/by-id/CAKU4AWpQjAqJwQ2X-aR9g3+ZHRzU1k8hNP7A+_mLuOv-n5aVKA@mail.gmail.com, [2]/messages/by-id/CAKU4AWrU35c9g3cE15JmVwh6B2Hzf4hf7cZUkRsiktv7AKR3Ag@mail.gmail.com). This means index skip scan could be reviewed and
discussed (and I'm planning to review the current design to see if it's
possible to improve it in the view of the latest changes), but
independently of the UniqueKey integration as it's subject to change.
But I'm afraid if things will go as it is and there will be not much
progress with the UniqueKey patch, I will have to withdraw this one
until everything is sorted out there.

Attachments:

v38-0001-Introduce-RelOptInfo-notnullattrs-attribute.patchtext/x-diff; charset=us-asciiDownload

From 70abf7131184a8cfc7e6b6d23b42c5d42633ea7f Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Sun, 3 May 2020 22:37:46 +0800
Subject: [PATCH v38 1/6] Introduce RelOptInfo->notnullattrs attribute

The notnullattrs is calculated from catalog and run-time query. That
infomation is translated to child relation as well for partitioned
table.
---
 src/backend/optimizer/path/allpaths.c  | 31 ++++++++++++++++++++++++++
 src/backend/optimizer/plan/initsplan.c | 10 +++++++++
 src/backend/optimizer/util/plancat.c   | 10 +++++++++
 src/include/nodes/pathnodes.h          |  2 ++
 4 files changed, 53 insertions(+)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index d73ac562eb..37b4223adb 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -998,6 +998,7 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 		RelOptInfo *childrel;
 		ListCell   *parentvars;
 		ListCell   *childvars;
+		int i = -1;
 
 		/* append_rel_list contains all append rels; ignore others */
 		if (appinfo->parent_relid != parentRTindex)
@@ -1054,6 +1055,36 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 								   (Node *) rel->reltarget->exprs,
 								   1, &appinfo);
 
+		/* Copy notnullattrs. */
+		while ((i = bms_next_member(rel->notnullattrs, i)) > 0)
+		{
+			AttrNumber attno = i + FirstLowInvalidHeapAttributeNumber;
+			AttrNumber child_attno;
+			if (attno == 0)
+			{
+				/* Whole row is not null, so must be same for child */
+				childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+														attno - FirstLowInvalidHeapAttributeNumber);
+				break;
+			}
+			if (attno < 0 )
+				/* no need to translate system column */
+				child_attno = attno;
+			else
+			{
+				Node * node = list_nth(appinfo->translated_vars, attno - 1);
+				if (!IsA(node, Var))
+					/* This may happens at UNION case, like (SELECT a FROM t1 UNION SELECT a + 3
+					 * FROM t2) t and we know t.a is not null
+					 */
+					continue;
+				child_attno = castNode(Var, node)->varattno;
+			}
+
+			childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+													child_attno - FirstLowInvalidHeapAttributeNumber);
+		}
+
 		/*
 		 * We have to make child entries in the EquivalenceClass data
 		 * structures as well.  This is needed either if the parent
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index 02f813cebd..d27167dc76 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -829,6 +829,16 @@ deconstruct_recurse(PlannerInfo *root, Node *jtnode, bool below_outer_join,
 		{
 			Node	   *qual = (Node *) lfirst(l);
 
+			/* Set the not null info now */
+			ListCell	*lc;
+			List		*non_nullable_vars = find_nonnullable_vars(qual);
+			foreach(lc, non_nullable_vars)
+			{
+				Var *var = lfirst_node(Var, lc);
+				RelOptInfo *rel = root->simple_rel_array[var->varno];
+				rel->notnullattrs = bms_add_member(rel->notnullattrs,
+												   var->varattno - FirstLowInvalidHeapAttributeNumber);
+			}
 			distribute_qual_to_rels(root, qual,
 									below_outer_join, JOIN_INNER,
 									root->qual_security_level,
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index c5947fa418..eebabcfccf 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -117,6 +117,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	Relation	relation;
 	bool		hasindex;
 	List	   *indexinfos = NIL;
+	int			i;
 
 	/*
 	 * We need not lock the relation since it was already locked, either by
@@ -480,6 +481,15 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	if (inhparent && relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		set_relation_partition_info(root, rel, relation);
 
+	Assert(rel->notnullattrs == NULL);
+	for(i = 0; i < relation->rd_att->natts; i++)
+	{
+		FormData_pg_attribute attr = relation->rd_att->attrs[i];
+		if (attr.attnotnull)
+			rel->notnullattrs = bms_add_member(rel->notnullattrs,
+											   attr.attnum - FirstLowInvalidHeapAttributeNumber);
+	}
+
 	table_close(relation, NoLock);
 
 	/*
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 86405a274e..0d61f04d27 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -718,6 +718,8 @@ typedef struct RelOptInfo
 	int			rel_parallel_workers;	/* wanted number of parallel workers */
 	uint32		amflags;		/* Bitmask of optional features supported by
 								 * the table AM */
+	/* Not null attrs, start from -FirstLowInvalidHeapAttributeNumber */
+	Bitmapset		*notnullattrs;
 
 	/* Information about foreign tables and foreign joins */
 	Oid			serverid;		/* identifies server for the table or join */
-- 
2.26.2

v38-0002-Introduce-UniqueKey-attributes-on-RelOptInfo-str.patchtext/x-diff; charset=us-asciiDownload

From ebb2d2ef3ec64bdb98cc1bff057b3383e1dc40c6 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Mon, 11 May 2020 15:50:52 +0800
Subject: [PATCH v38 2/6] Introduce UniqueKey attributes on RelOptInfo struct.

UniqueKey is a set of exprs on RelOptInfo which represents the exprs
will be unique on the given RelOptInfo. You can see README.uniquekey
for more information.
---
 src/backend/nodes/copyfuncs.c               |   13 +
 src/backend/nodes/list.c                    |   31 +
 src/backend/nodes/makefuncs.c               |   13 +
 src/backend/nodes/outfuncs.c                |   11 +
 src/backend/nodes/readfuncs.c               |   10 +
 src/backend/optimizer/path/Makefile         |    3 +-
 src/backend/optimizer/path/README.uniquekey |  131 +++
 src/backend/optimizer/path/allpaths.c       |   10 +
 src/backend/optimizer/path/joinpath.c       |    9 +-
 src/backend/optimizer/path/joinrels.c       |    2 +
 src/backend/optimizer/path/pathkeys.c       |    3 +-
 src/backend/optimizer/path/uniquekeys.c     | 1135 +++++++++++++++++++
 src/backend/optimizer/plan/planner.c        |   13 +-
 src/backend/optimizer/prep/prepunion.c      |    2 +
 src/backend/optimizer/util/appendinfo.c     |   44 +
 src/backend/optimizer/util/inherit.c        |   18 +-
 src/include/nodes/makefuncs.h               |    3 +
 src/include/nodes/nodes.h                   |    1 +
 src/include/nodes/pathnodes.h               |   29 +-
 src/include/nodes/pg_list.h                 |   10 +
 src/include/optimizer/appendinfo.h          |    3 +
 src/include/optimizer/optimizer.h           |    2 +
 src/include/optimizer/paths.h               |   43 +
 23 files changed, 1515 insertions(+), 24 deletions(-)
 create mode 100644 src/backend/optimizer/path/README.uniquekey
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index da91cbd2b1..75c1c5e824 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2296,6 +2296,16 @@ _copyPathKey(const PathKey *from)
 	return newnode;
 }
 
+static UniqueKey *
+_copyUniqueKey(const UniqueKey *from)
+{
+	UniqueKey	*newnode = makeNode(UniqueKey);
+
+	COPY_NODE_FIELD(exprs);
+	COPY_SCALAR_FIELD(multi_nullvals);
+
+	return newnode;
+}
 /*
  * _copyRestrictInfo
  */
@@ -5220,6 +5230,9 @@ copyObjectImpl(const void *from)
 		case T_PathKey:
 			retval = _copyPathKey(from);
 			break;
+		case T_UniqueKey:
+			retval = _copyUniqueKey(from);
+			break;
 		case T_RestrictInfo:
 			retval = _copyRestrictInfo(from);
 			break;
diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index dbf6b30233..ca099495a1 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -702,6 +702,37 @@ list_member_oid(const List *list, Oid datum)
 	return false;
 }
 
+/*
+ * return true iff every entry in "members" list is also present
+ * in the "target" list.
+ */
+bool
+list_is_subset(const List *members, const List *target)
+{
+	const ListCell	*lc1, *lc2;
+
+	Assert(IsPointerList(members));
+	Assert(IsPointerList(target));
+	check_list_invariants(members);
+	check_list_invariants(target);
+
+	foreach(lc1, members)
+	{
+		bool found = false;
+		foreach(lc2, target)
+		{
+			if (equal(lfirst(lc1), lfirst(lc2)))
+			{
+				found = true;
+				break;
+			}
+		}
+		if (!found)
+			return false;
+	}
+	return true;
+}
+
 /*
  * Delete the n'th cell (counting from 0) in list.
  *
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 01c110cd2f..40415d0f5b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -815,3 +815,16 @@ makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols)
 	v->va_cols = va_cols;
 	return v;
 }
+
+
+/*
+ * makeUniqueKey
+ */
+UniqueKey*
+makeUniqueKey(List *exprs, bool multi_nullvals)
+{
+	UniqueKey * ukey = makeNode(UniqueKey);
+	ukey->exprs = exprs;
+	ukey->multi_nullvals = multi_nullvals;
+	return ukey;
+}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 6493a03ff8..44154cde6a 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2456,6 +2456,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+	WRITE_NODE_FIELD(exprs);
+	WRITE_BOOL_FIELD(multi_nullvals);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4198,6 +4206,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 				_outPathKey(str, obj);
 				break;
+			case T_UniqueKey:
+				_outUniqueKey(str, obj);
+				break;
 			case T_PathTarget:
 				_outPathTarget(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index c5e136e9c3..b3e212bf1c 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -492,6 +492,14 @@ _readSetOperationStmt(void)
 	READ_DONE();
 }
 
+static UniqueKey *
+_readUniqueKey(void)
+{
+	READ_LOCALS(UniqueKey);
+	READ_NODE_FIELD(exprs);
+	READ_BOOL_FIELD(multi_nullvals);
+	READ_DONE();
+}
 
 /*
  *	Stuff from primnodes.h.
@@ -2717,6 +2725,8 @@ parseNodeString(void)
 		return_value = _readCommonTableExpr();
 	else if (MATCH("SETOPERATIONSTMT", 16))
 		return_value = _readSetOperationStmt();
+	else if (MATCH("UNIQUEKEY", 9))
+		return_value = _readUniqueKey();
 	else if (MATCH("ALIAS", 5))
 		return_value = _readAlias();
 	else if (MATCH("RANGEVAR", 8))
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..7b9820c25f 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekeys.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/README.uniquekey b/src/backend/optimizer/path/README.uniquekey
new file mode 100644
index 0000000000..5eac761995
--- /dev/null
+++ b/src/backend/optimizer/path/README.uniquekey
@@ -0,0 +1,131 @@
+1. What is UniqueKey?
+We can think UniqueKey is a set of exprs for a RelOptInfo, which we are insure
+that doesn't yields same result among all the rows. The simplest UniqueKey
+format is primary key.
+
+However we define the UnqiueKey as below.
+
+typedef struct UniqueKey
+{
+        NodeTag	type;
+        List	*exprs;
+        bool	multi_nullvals;
+} UniqueKey;
+
+exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+is a special case of UniqueKey, which means there is only one row in that
+relation.it has a stronger semantic than others. like SELECT uk FROM t; uk is
+normal unique key and may have different values. SELECT colx FROM t WHERE uk =
+const.  colx is unique AND we have only 1 value. This field can used for
+innerrel_is_unique. this logic is handled specially in add_uniquekey_for_onerow
+function.
+
+multi_nullvals: true means multi null values may exist in these exprs, so the
+uniqueness is not guaranteed in this case. This field is necessary for
+remove_useless_join & reduce_unique_semijoins where we don't mind these
+duplicated NULL values. It is set to true for 2 cases. One is a unique key
+from a unique index but the related column is nullable. The other one is for
+outer join. see populate_joinrel_uniquekeys for detail.
+
+
+The UniqueKey can be used at the following cases at least:
+1. remove_useless_joins.
+2. reduce_semianti_joins
+3. remove distinct node if distinct clause is unique.
+4. remove aggnode if group by clause is unique.
+5. Index Skip Scan (WIP)
+6. Aggregation Push Down without 2 phase aggregation if the join can't
+   duplicated the aggregated rows. (work in progress feature)
+
+2. How is it maintained?
+
+We have a set of populate_xxx_unqiuekeys functions to maintain the uniquekey on
+various cases. xxx includes baserel, joinrel, partitionedrel, distinctrel,
+groupedrel, unionrel. and we also need to convert the uniquekey from subquery
+to outer relation, which is what convert_subquery_uniquekeys does.
+
+1. The first part is about baserel. We handled 3 cases. suppose we have Unique
+Index on (a, b).
+
+1. SELECT a, b FROM t.  UniqueKey (a, b)
+2. SELECT a FROM t WHERE b = 1;  UniqueKey (a)
+3. SELECT .. FROM t WHERE a = 1 AND b = 1;  UniqueKey (NIL).  onerow case, every
+   column is Unique.
+
+2. The next part is joinrel, this part is most error-prone, we simplified the rules
+like below:
+1. If the relation's UniqueKey can't be duplicated after join,  then is will be
+   still valid for the join rel. The function we used here is
+   innerrel_keeps_unique. The basic idea is innerrel.any_col = outer.uk.
+
+2. If the UnqiueKey can't keep valid via the rule 1, the combination of the
+   UniqueKey from both sides are valid for sure.  We can prove this as: if the
+   unique exprs from rel1 is duplicated by rel2, the duplicated rows must
+   contains different unique exprs from rel2.
+
+More considerations about onerow:
+1. If relation with one row and it can't be duplicated, it is still possible
+   contains mulit_nullvas after outer join.
+2. If the either UniqueKey can be duplicated after join, the can get one row
+   only when both side is one row AND there is no outer join.
+3. Whenever the onerow UniqueKey is not a valid any more, we need to convert one
+   row UniqueKey to normal unique key since we don't store exprs for one-row
+   relation. get_exprs_from_uniquekeys will be used here.
+
+
+More considerations about multi_nullvals after join:
+1. If the original UnqiueKey has multi_nullvals, the final UniqueKey will have
+   mulit_nullvals in any case.
+2. If a unique key doesn't allow mulit_nullvals, after some outer join, it
+   allows some outer join.
+
+
+3. When we comes to subquery, we need to convert_subquery_unqiuekeys just like
+convert_subquery_pathkeys.  Only the UniqueKey insides subquery is referenced as
+a Var in outer relation will be reused. The relationship between the outerrel.Var
+and subquery.exprs is built with outerel->subroot->processed_tlist.
+
+
+4. As for the SRF functions, it will break the uniqueness of uniquekey, However it
+is handled in adjust_paths_for_srfs, which happens after the query_planner. so
+we will maintain the UniqueKey until there and reset it to NIL at that
+places. This can't help on distinct/group by elimination cases but probably help
+in some other cases, like reduce_unqiue_semijoins/remove_useless_joins and it is
+semantic correctly.
+
+
+5. As for inherit table, we first main the UnqiueKey on childrel as well. But for
+partitioned table we need to maintain 2 different kinds of
+UnqiueKey. 1). UniqueKey on the parent relation 2). UniqueKey on child
+relation for partition wise query.
+
+Example:
+CREATE TABLE p (a int not null, b int not null) partition by list (a);
+CREATE TABLE p0 partition of p for values in (1);
+CREATE TABLE p1 partition of p for values in (2);
+
+create unique index p0_b on p0(b);
+create unique index p1_b on p1(b);
+
+Now b is only unique on partition level, so the distinct can't be removed on
+the following cases. SELECT DISTINCT b FROM p;
+
+Another example is SELECT DISTINCT a, b FROM p WHERE a = 1; Since only one
+partition is chosen, the UniqueKey on child relation is same as the UniqueKey on
+parent relation.
+
+Another usage of UniqueKey on partition level is it be helpful for
+partition-wise join.
+
+As for the UniqueKey on parent table level, it comes with 2 different ways,
+1). the UniqueKey is also derived in UniqueKey index, but the index must be same
+in all the related children relations and the unique index must contains
+Partition Key in it. Example:
+
+CREATE UNIQUE INDEX p_ab ON p(a, b);  -- where a is the partition key.
+
+-- Query
+SELECT a, b FROM p; the (a, b) is a UniqueKey of p.
+
+2). If the parent relation has only one childrel, the UniqueKey on childrel is
+ the UniqueKey on parent as well.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 37b4223adb..d52ad59f89 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -579,6 +579,12 @@ set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	 */
 	check_index_predicates(root, rel);
 
+	/*
+	 * Now that we've marked which partial indexes are suitable, we can now
+	 * build the relation's unique keys.
+	 */
+	populate_baserel_uniquekeys(root, rel, rel->indexlist);
+
 	/* Mark rel with estimated output rows, width, etc */
 	set_baserel_size_estimates(root, rel);
 }
@@ -1297,6 +1303,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 
 	/* Add paths to the append relation. */
 	add_paths_to_append_rel(root, rel, live_childrels);
+	if (IS_PARTITIONED_REL(rel))
+		populate_partitionedrel_uniquekeys(root, rel, live_childrels);
 }
 
 
@@ -2304,6 +2312,8 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 										  pathkeys, required_outer));
 	}
 
+	convert_subquery_uniquekeys(root, rel, sub_final_rel);
+
 	/* If outer rel allows parallelism, do same for partial paths. */
 	if (rel->consider_parallel && bms_is_empty(required_outer))
 	{
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index 57ce97fd53..697ac047fb 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -71,13 +71,6 @@ static void consider_parallel_mergejoin(PlannerInfo *root,
 static void hash_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 								 RelOptInfo *outerrel, RelOptInfo *innerrel,
 								 JoinType jointype, JoinPathExtraData *extra);
-static List *select_mergejoin_clauses(PlannerInfo *root,
-									  RelOptInfo *joinrel,
-									  RelOptInfo *outerrel,
-									  RelOptInfo *innerrel,
-									  List *restrictlist,
-									  JoinType jointype,
-									  bool *mergejoin_allowed);
 static void generate_mergejoin_paths(PlannerInfo *root,
 									 RelOptInfo *joinrel,
 									 RelOptInfo *innerrel,
@@ -1927,7 +1920,7 @@ hash_inner_and_outer(PlannerInfo *root,
  * if it is mergejoinable and involves vars from the two sub-relations
  * currently of interest.
  */
-static List *
+List *
 select_mergejoin_clauses(PlannerInfo *root,
 						 RelOptInfo *joinrel,
 						 RelOptInfo *outerrel,
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 0dbe2ac726..7271f044ec 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -924,6 +924,8 @@ populate_joinrel_with_paths(PlannerInfo *root, RelOptInfo *rel1,
 
 	/* Apply partitionwise join technique, if possible. */
 	try_partitionwise_join(root, rel1, rel2, joinrel, sjinfo, restrictlist);
+
+	populate_joinrel_uniquekeys(root, joinrel, rel1, rel2, restrictlist, sjinfo->jointype);
 }
 
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index bd9a176d7d..139278829b 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -33,7 +33,6 @@ static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
 											 int partkeycol);
-static Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 static bool right_merge_direction(PlannerInfo *root, PathKey *pathkey);
 
 
@@ -1035,7 +1034,7 @@ convert_subquery_pathkeys(PlannerInfo *root, RelOptInfo *rel,
  * We need this to ensure that we don't return pathkeys describing values
  * that are unavailable above the level of the subquery scan.
  */
-static Var *
+Var *
 find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle)
 {
 	ListCell   *lc;
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
new file mode 100644
index 0000000000..77ed2b2eff
--- /dev/null
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -0,0 +1,1135 @@
+/*-------------------------------------------------------------------------
+ *
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/uniquekeys.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/appendinfo.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/tlist.h"
+#include "rewrite/rewriteManip.h"
+
+
+/*
+ * This struct is used to help populate_joinrel_uniquekeys.
+ *
+ * added_to_joinrel is true if a uniquekey (from outerrel or innerrel)
+ * has been added to joinrel.
+ * useful is true if the exprs of the uniquekey still appears in joinrel.
+ */
+typedef struct UniqueKeyContextData
+{
+	UniqueKey	*uniquekey;
+	bool	added_to_joinrel;
+	bool	useful;
+} *UniqueKeyContext;
+
+static List *initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel);
+static bool innerrel_keeps_unique(PlannerInfo *root,
+								  RelOptInfo *outerrel,
+								  RelOptInfo *innerrel,
+								  List *restrictlist,
+								  bool reverse);
+
+static List *get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+										List *const_exprs,
+										List *const_expr_opfamilies,
+										Bitmapset *used_varattrs,
+										bool *useful,
+										bool *multi_nullvals);
+static List *get_exprs_from_uniquekey(PlannerInfo *root,
+									  RelOptInfo *joinrel,
+									  RelOptInfo *rel1,
+									  UniqueKey *ukey);
+static void add_uniquekey_for_onerow(RelOptInfo *rel);
+static bool add_combined_uniquekey(PlannerInfo *root,
+								   RelOptInfo *joinrel,
+								   RelOptInfo *outer_rel,
+								   RelOptInfo *inner_rel,
+								   UniqueKey *outer_ukey,
+								   UniqueKey *inner_ukey,
+								   JoinType jointype);
+
+/* Used for unique indexes checking for partitioned table */
+static bool index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind);
+static IndexOptInfo *simple_copy_indexinfo_to_parent(PlannerInfo *root,
+													 RelOptInfo *parentrel,
+													 IndexOptInfo *from);
+static bool simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2);
+static void adjust_partition_unique_indexlist(PlannerInfo *root,
+											  RelOptInfo *parentrel,
+											  RelOptInfo *childrel,
+											  List **global_unique_index);
+
+/* Helper function for grouped relation and distinct relation. */
+static void add_uniquekey_from_sortgroups(PlannerInfo *root,
+										  RelOptInfo *rel,
+										  List *sortgroups);
+
+/*
+ * populate_baserel_uniquekeys
+ *		Populate 'baserel' uniquekeys list by looking at the rel's unique index
+ * and baserestrictinfo
+ */
+void
+populate_baserel_uniquekeys(PlannerInfo *root,
+							RelOptInfo *baserel,
+							List *indexlist)
+{
+	ListCell *lc;
+	List	*matched_uniq_indexes = NIL;
+
+	/* Attrs appears in rel->reltarget->exprs. */
+	Bitmapset *used_attrs = NULL;
+
+	List	*const_exprs = NIL;
+	List	*expr_opfamilies = NIL;
+
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	foreach(lc, indexlist)
+	{
+		IndexOptInfo *ind = (IndexOptInfo *) lfirst(lc);
+		if (!ind->unique || !ind->immediate ||
+			(ind->indpred != NIL && !ind->predOK))
+			continue;
+		matched_uniq_indexes = lappend(matched_uniq_indexes, ind);
+	}
+
+	if (matched_uniq_indexes  == NIL)
+		return;
+
+	/* Check which attrs is used in baserel->reltarget */
+	pull_varattnos((Node *)baserel->reltarget->exprs, baserel->relid, &used_attrs);
+
+	/* Check which attrno is used at a mergeable const filter */
+	foreach(lc, baserel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+		if (rinfo->mergeopfamilies == NIL)
+			continue;
+
+		if (bms_is_empty(rinfo->left_relids))
+		{
+			const_exprs = lappend(const_exprs, get_rightop(rinfo->clause));
+		}
+		else if (bms_is_empty(rinfo->right_relids))
+		{
+			const_exprs = lappend(const_exprs, get_leftop(rinfo->clause));
+		}
+		else
+			continue;
+
+		expr_opfamilies = lappend(expr_opfamilies, rinfo->mergeopfamilies);
+	}
+
+	foreach(lc, matched_uniq_indexes)
+	{
+		bool	multi_nullvals, useful;
+		List	*exprs = get_exprs_from_uniqueindex(lfirst_node(IndexOptInfo, lc),
+													const_exprs,
+													expr_opfamilies,
+													used_attrs,
+													&useful,
+													&multi_nullvals);
+		if (useful)
+		{
+			if (exprs == NIL)
+			{
+				/* All the columns in Unique Index matched with a restrictinfo */
+				add_uniquekey_for_onerow(baserel);
+				return;
+			}
+			baserel->uniquekeys = lappend(baserel->uniquekeys,
+										  makeUniqueKey(exprs, multi_nullvals));
+		}
+	}
+}
+
+
+/*
+ * populate_partitionedrel_uniquekeys
+ * The UniqueKey on partitionrel comes from 2 cases:
+ * 1). Only one partition is involved in this query, the unique key can be
+ * copied to parent rel from childrel.
+ * 2). There are some unique index which includes partition key and exists
+ * in all the related partitions.
+ * We never mind rule 2 if we hit rule 1.
+ */
+
+void
+populate_partitionedrel_uniquekeys(PlannerInfo *root,
+								   RelOptInfo *rel,
+								   List *childrels)
+{
+	ListCell	*lc;
+	List	*global_uniq_indexlist = NIL;
+	RelOptInfo *childrel;
+	bool is_first = true;
+
+	Assert(IS_PARTITIONED_REL(rel));
+
+	if (childrels == NIL)
+		return;
+
+	/*
+	 * If there is only one partition used in this query, the UniqueKey in childrel is
+	 * still valid in parent level, but we need convert the format from child expr to
+	 * parent expr.
+	 */
+	if (list_length(childrels) == 1)
+	{
+		/* Check for Rule 1 */
+		RelOptInfo *childrel = linitial_node(RelOptInfo, childrels);
+		ListCell	*lc;
+		Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL);
+		if (relation_is_onerow(childrel))
+		{
+			add_uniquekey_for_onerow(rel);
+			return;
+		}
+
+		foreach(lc, childrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			AppendRelInfo *appinfo = find_appinfo_by_child(root, childrel->relid);
+			List *parent_exprs = NIL;
+			bool can_reuse = true;
+			ListCell	*lc2;
+			foreach(lc2, ukey->exprs)
+			{
+				Var *var = (Var *)lfirst(lc2);
+				/*
+				 * If the expr comes from a expression, it is hard to build the expression
+				 * in parent so ignore that case for now.
+				 */
+				if(!IsA(var, Var))
+				{
+					can_reuse = false;
+					break;
+				}
+				/* Convert it to parent var */
+				parent_exprs = lappend(parent_exprs, find_parent_var(appinfo, var));
+			}
+			if (can_reuse)
+				rel->uniquekeys = lappend(rel->uniquekeys,
+										  makeUniqueKey(parent_exprs,
+														ukey->multi_nullvals));
+		}
+	}
+	else
+	{
+		/* Check for rule 2 */
+		childrel = linitial_node(RelOptInfo, childrels);
+		foreach(lc, childrel->indexlist)
+		{
+			IndexOptInfo *ind = lfirst(lc);
+			IndexOptInfo *modified_index;
+			if (!ind->unique || !ind->immediate ||
+				(ind->indpred != NIL && !ind->predOK))
+				continue;
+
+			/*
+			 * During simple_copy_indexinfo_to_parent, we need to convert var from
+			 * child var to parent var, index on expression is too complex to handle.
+			 * so ignore it for now.
+			 */
+			if (ind->indexprs != NIL)
+				continue;
+
+			modified_index = simple_copy_indexinfo_to_parent(root, rel, ind);
+			/*
+			 * If the unique index doesn't contain partkey, then it is unique
+			 * on this partition only, so it is useless for us.
+			 */
+			if (!index_constains_partkey(rel, modified_index))
+				continue;
+
+			global_uniq_indexlist = lappend(global_uniq_indexlist,  modified_index);
+		}
+
+		if (global_uniq_indexlist != NIL)
+		{
+			foreach(lc, childrels)
+			{
+				RelOptInfo *child = lfirst(lc);
+				if (is_first)
+				{
+					is_first = false;
+					continue;
+				}
+				adjust_partition_unique_indexlist(root, rel, child, &global_uniq_indexlist);
+			}
+			/* Now we have a list of unique index which are exactly same on all childrels,
+			 * Set the UniqueKey just like it is non-partition table
+			 */
+			populate_baserel_uniquekeys(root, rel, global_uniq_indexlist);
+		}
+	}
+}
+
+
+/*
+ * populate_distinctrel_uniquekeys
+ */
+void
+populate_distinctrel_uniquekeys(PlannerInfo *root,
+								RelOptInfo *inputrel,
+								RelOptInfo *distinctrel)
+{
+	/* The unique key before the distinct is still valid. */
+	distinctrel->uniquekeys = list_copy(inputrel->uniquekeys);
+	add_uniquekey_from_sortgroups(root, distinctrel, root->parse->distinctClause);
+}
+
+/*
+ * populate_grouprel_uniquekeys
+ */
+void
+populate_grouprel_uniquekeys(PlannerInfo *root,
+							 RelOptInfo *grouprel,
+							 RelOptInfo *inputrel)
+
+{
+	Query *parse = root->parse;
+	bool input_ukey_added = false;
+	ListCell *lc;
+
+	if (relation_is_onerow(inputrel))
+	{
+		add_uniquekey_for_onerow(grouprel);
+		return;
+	}
+	if (parse->groupingSets)
+		return;
+
+	/* A Normal group by without grouping set. */
+	if (parse->groupClause)
+	{
+		/*
+		 * Current even the groupby clause is Unique already, but if query has aggref
+		 * We have to create grouprel still. To keep the UnqiueKey short, we will check
+		 * the UniqueKey of input_rel still valid, if so we reuse it.
+		 */
+		foreach(lc, inputrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(ukey->exprs, grouprel->reltarget->exprs))
+			{
+				grouprel->uniquekeys = lappend(grouprel->uniquekeys,
+											   ukey);
+				input_ukey_added = true;
+			}
+		}
+		if (!input_ukey_added)
+			/*
+			 * group by clause must be a super-set of grouprel->reltarget->exprs except the
+			 * aggregation expr, so if such exprs is unique already, no bother to generate
+			 * new uniquekey for group by exprs.
+			 */
+			add_uniquekey_from_sortgroups(root,
+										  grouprel,
+										  root->parse->groupClause);
+	}
+	else
+		/* It has aggregation but without a group by, so only one row returned */
+		add_uniquekey_for_onerow(grouprel);
+}
+
+/*
+ * simple_copy_uniquekeys
+ * Using a function for the one-line code makes us easy to check where we simply
+ * copied the uniquekey.
+ */
+void
+simple_copy_uniquekeys(RelOptInfo *oldrel,
+					   RelOptInfo *newrel)
+{
+	newrel->uniquekeys = oldrel->uniquekeys;
+}
+
+/*
+ *  populate_unionrel_uniquekeys
+ */
+void
+populate_unionrel_uniquekeys(PlannerInfo *root,
+							  RelOptInfo *unionrel)
+{
+	ListCell	*lc;
+	List	*exprs = NIL;
+
+	Assert(unionrel->uniquekeys == NIL);
+
+	foreach(lc, unionrel->reltarget->exprs)
+	{
+		exprs = lappend(exprs, lfirst(lc));
+	}
+
+	if (exprs == NIL)
+		/* SQL: select union select; is valid, we need to handle it here. */
+		add_uniquekey_for_onerow(unionrel);
+	else
+		unionrel->uniquekeys = lappend(unionrel->uniquekeys,
+									   makeUniqueKey(exprs,false));
+
+}
+
+/*
+ * populate_joinrel_uniquekeys
+ *
+ * populate uniquekeys for joinrel. We will check each relation to see if its
+ * UniqueKey is still valid via innerrel_keeps_unique, if so, we add it to
+ * joinrel.  The multi_nullvals field will be changed to true for some outer
+ * join cases and one-row UniqueKey needs to be converted to normal UniqueKey
+ * for the same case as well.
+ * For the uniquekey in either baserel which can't be unique after join, we still
+ * check to see if combination of UniqueKeys from both side is still useful for us.
+ * if yes, we add it to joinrel as well.
+ */
+void
+populate_joinrel_uniquekeys(PlannerInfo *root, RelOptInfo *joinrel,
+							RelOptInfo *outerrel, RelOptInfo *innerrel,
+							List *restrictlist, JoinType jointype)
+{
+	ListCell *lc, *lc2;
+	List	*clause_list = NIL;
+	List	*outerrel_ukey_ctx;
+	List	*innerrel_ukey_ctx;
+	bool	inner_onerow, outer_onerow;
+	bool	mergejoin_allowed;
+
+	/* Care about the outerrel relation only for SEMI/ANTI join */
+	if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
+	{
+		foreach(lc, outerrel->uniquekeys)
+		{
+			UniqueKey	*uniquekey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(uniquekey->exprs, joinrel->reltarget->exprs))
+				joinrel->uniquekeys = lappend(joinrel->uniquekeys, uniquekey);
+		}
+		return;
+	}
+
+	Assert(jointype == JOIN_LEFT || jointype == JOIN_FULL || jointype == JOIN_INNER);
+
+	/* Fast path */
+	if (innerrel->uniquekeys == NIL || outerrel->uniquekeys == NIL)
+		return;
+
+	inner_onerow = relation_is_onerow(innerrel);
+	outer_onerow = relation_is_onerow(outerrel);
+
+	outerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(outerrel);
+	innerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(innerrel);
+
+	clause_list = select_mergejoin_clauses(root, joinrel, outerrel, innerrel,
+										   restrictlist, jointype,
+										   &mergejoin_allowed);
+
+	if (innerrel_keeps_unique(root, innerrel, outerrel, clause_list, true /* reverse */))
+	{
+		bool outer_impact = jointype == JOIN_FULL;
+		foreach(lc, outerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			/* Outer relation has one row, and the unique key is not duplicated after join,
+			 * the joinrel will still has one row unless the jointype == JOIN_FULL.
+			 */
+			if (outer_onerow && !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (outer_onerow)
+			{
+				/*
+				 * The onerow outerrel becomes multi rows and multi_nullvals
+				 * will be changed to true. We also need to set the exprs correctly since it
+				 * can't be NIL any more.
+				 */
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(root, joinrel, outerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Change multi_nullvals to true due to the full join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs, true));
+				else
+					/* Just reuse it */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	if (innerrel_keeps_unique(root, outerrel, innerrel, clause_list, false))
+	{
+		bool outer_impact = jointype == JOIN_FULL || jointype == JOIN_LEFT;;
+
+		foreach(lc, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			if (inner_onerow &&  !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (inner_onerow)
+			{
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(root, joinrel, innerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Need to change multi_nullvals to true due to the outer join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs,
+																true));
+				else
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	/*
+	 * The combination of the UniqueKey from both sides is unique as well regardless
+	 * of join type, but no bother to add it if its subset has been added to joinrel
+	 * already or it is not useful for the joinrel.
+	 */
+	foreach(lc, outerrel_ukey_ctx)
+	{
+		UniqueKeyContext ctx1 = (UniqueKeyContext) lfirst(lc);
+		if (ctx1->added_to_joinrel || !ctx1->useful)
+			continue;
+		foreach(lc2, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx2 = (UniqueKeyContext) lfirst(lc2);
+			if (ctx2->added_to_joinrel || !ctx2->useful)
+				continue;
+			if (add_combined_uniquekey(root, joinrel, outerrel, innerrel,
+									   ctx1->uniquekey, ctx2->uniquekey,
+									   jointype))
+				/* If we set a onerow UniqueKey to joinrel, we don't need other. */
+				return;
+		}
+	}
+}
+
+
+/*
+ * convert_subquery_uniquekeys
+ *
+ * Covert the UniqueKey in subquery to outer relation.
+ */
+void convert_subquery_uniquekeys(PlannerInfo *root,
+								 RelOptInfo *currel,
+								 RelOptInfo *sub_final_rel)
+{
+	ListCell	*lc;
+
+	if (sub_final_rel->uniquekeys == NIL)
+		return;
+
+	if (relation_is_onerow(sub_final_rel))
+	{
+		add_uniquekey_for_onerow(currel);
+		return;
+	}
+
+	Assert(currel->subroot != NULL);
+
+	foreach(lc, sub_final_rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		ListCell	*lc;
+		List	*exprs = NIL;
+		bool	ukey_useful = true;
+
+		/* One row case is handled above */
+		Assert(ukey->exprs != NIL);
+		foreach(lc, ukey->exprs)
+		{
+			Var *var;
+			TargetEntry *tle = tlist_member(lfirst(lc),
+											currel->subroot->processed_tlist);
+			if (tle == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			var = find_var_for_subquery_tle(currel, tle);
+			if (var == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			exprs = lappend(exprs, var);
+		}
+
+		if (ukey_useful)
+			currel->uniquekeys = lappend(currel->uniquekeys,
+										 makeUniqueKey(exprs,
+													   ukey->multi_nullvals));
+
+	}
+}
+
+/*
+ * innerrel_keeps_unique
+ *
+ * Check if Unique key of the innerrel is valid after join. innerrel's UniqueKey
+ * will be still valid if innerrel's any-column mergeop outrerel's uniquekey
+ * exists in clause_list.
+ *
+ * Note: the clause_list must be a list of mergeable restrictinfo already.
+ */
+static bool
+innerrel_keeps_unique(PlannerInfo *root,
+					  RelOptInfo *outerrel,
+					  RelOptInfo *innerrel,
+					  List *clause_list,
+					  bool reverse)
+{
+	ListCell	*lc, *lc2, *lc3;
+
+	if (outerrel->uniquekeys == NIL || innerrel->uniquekeys == NIL)
+		return false;
+
+	/* Check if there is outerrel's uniquekey in mergeable clause. */
+	foreach(lc, outerrel->uniquekeys)
+	{
+		List	*outer_uq_exprs = lfirst_node(UniqueKey, lc)->exprs;
+		bool clauselist_matchs_all_exprs = true;
+		foreach(lc2, outer_uq_exprs)
+		{
+			Node *outer_uq_expr = lfirst(lc2);
+			bool find_uq_expr_in_clauselist = false;
+			foreach(lc3, clause_list)
+			{
+				RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc3);
+				Node *outer_expr;
+				if (reverse)
+					outer_expr = rinfo->outer_is_left ? get_rightop(rinfo->clause) : get_leftop(rinfo->clause);
+				else
+					outer_expr = rinfo->outer_is_left ? get_leftop(rinfo->clause) : get_rightop(rinfo->clause);
+				if (equal(outer_expr, outer_uq_expr))
+				{
+					find_uq_expr_in_clauselist = true;
+					break;
+				}
+			}
+			if (!find_uq_expr_in_clauselist)
+			{
+				/* No need to check the next exprs in the current uniquekey */
+				clauselist_matchs_all_exprs = false;
+				break;
+			}
+		}
+
+		if (clauselist_matchs_all_exprs)
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * relation_is_onerow
+ * Check if it is a one-row relation by checking UniqueKey.
+ */
+bool
+relation_is_onerow(RelOptInfo *rel)
+{
+	UniqueKey *ukey;
+	if (rel->uniquekeys == NIL)
+		return false;
+	ukey = linitial_node(UniqueKey, rel->uniquekeys);
+	return ukey->exprs == NIL && list_length(rel->uniquekeys) == 1;
+}
+
+/*
+ * relation_has_uniquekeys_for
+ *		Returns true if we have proofs that 'rel' cannot return multiple rows with
+ *		the same values in each of 'exprs'.  Otherwise returns false.
+ */
+bool
+relation_has_uniquekeys_for(PlannerInfo *root, RelOptInfo *rel,
+							List *exprs, bool allow_multinulls)
+{
+	ListCell *lc;
+
+	/*
+	 * For UniqueKey->onerow case, the uniquekey->exprs is empty as well
+	 * so we can't rely on list_is_subset to handle this special cases
+	 */
+	if (exprs == NIL)
+		return false;
+
+	foreach(lc, rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		if (ukey->multi_nullvals && !allow_multinulls)
+			continue;
+		if (list_is_subset(ukey->exprs, exprs))
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * get_exprs_from_uniqueindex
+ *
+ * Return a list of exprs which is unique. set useful to false if this
+ * unique index is not useful for us.
+ */
+static List *
+get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+						   List *const_exprs,
+						   List *const_expr_opfamilies,
+						   Bitmapset *used_varattrs,
+						   bool *useful,
+						   bool *multi_nullvals)
+{
+	List	*exprs = NIL;
+	ListCell	*indexpr_item;
+	int	c = 0;
+
+	*useful = true;
+	*multi_nullvals = false;
+
+	indexpr_item = list_head(unique_index->indexprs);
+	for(c = 0; c < unique_index->ncolumns; c++)
+	{
+		int attr = unique_index->indexkeys[c];
+		Expr *expr;
+		bool	matched_const = false;
+		ListCell	*lc1, *lc2;
+
+		if(attr > 0)
+		{
+			expr = list_nth_node(TargetEntry, unique_index->indextlist, c)->expr;
+		}
+		else if (attr == 0)
+		{
+			/* Expression index */
+			expr = lfirst(indexpr_item);
+			indexpr_item = lnext(unique_index->indexprs, indexpr_item);
+		}
+		else /* attr < 0 */
+		{
+			/* Index on system column is not supported */
+			Assert(false);
+		}
+
+		/*
+		 * Check index_col = Const case with regarding to opfamily checking
+		 * If we can remove the index_col from the final UniqueKey->exprs.
+		 */
+		forboth(lc1, const_exprs, lc2, const_expr_opfamilies)
+		{
+			if (list_member_oid((List *)lfirst(lc2), unique_index->opfamily[c])
+				&& match_index_to_operand((Node *) lfirst(lc1), c, unique_index))
+			{
+				matched_const = true;
+				break;
+			}
+		}
+
+		if (matched_const)
+			continue;
+
+		/* Check if the indexed expr is used in rel */
+		if (attr > 0)
+		{
+			/*
+			 * Normal Indexed column, if the col is not used, then the index is useless
+			 * for uniquekey.
+			 */
+			attr -= FirstLowInvalidHeapAttributeNumber;
+
+			if (!bms_is_member(attr, used_varattrs))
+			{
+				*useful = false;
+				break;
+			}
+		}
+		else if (!list_member(unique_index->rel->reltarget->exprs, expr))
+		{
+			/* Expression index but the expression is not used in rel */
+			*useful = false;
+			break;
+		}
+
+		/* check not null property. */
+		if (attr == 0)
+		{
+			/* We never know if a expression yields null or not */
+			*multi_nullvals = true;
+		}
+		else if (!bms_is_member(attr, unique_index->rel->notnullattrs)
+				 && !bms_is_member(0 - FirstLowInvalidHeapAttributeNumber,
+								   unique_index->rel->notnullattrs))
+		{
+			*multi_nullvals = true;
+		}
+
+		exprs = lappend(exprs, expr);
+	}
+	return exprs;
+}
+
+
+/*
+ * add_uniquekey_for_onerow
+ * If we are sure that the relation only returns one row, then all the columns
+ * are unique. However we don't need to create UniqueKey for every column, we
+ * just set exprs = NIL and overwrites all the other UniqueKey on this RelOptInfo
+ * since this one has strongest semantics.
+ */
+void
+add_uniquekey_for_onerow(RelOptInfo *rel)
+{
+	/*
+	 * We overwrite the previous UniqueKey on purpose since this one has the
+	 * strongest semantic.
+	 */
+	rel->uniquekeys = list_make1(makeUniqueKey(NIL, false));
+}
+
+
+/*
+ * initililze_uniquecontext_for_joinrel
+ * Return a List of UniqueKeyContext for an inputrel
+ */
+static List *
+initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel)
+{
+	List	*res = NIL;
+	ListCell *lc;
+	foreach(lc,  inputrel->uniquekeys)
+	{
+		UniqueKeyContext context;
+		context = palloc(sizeof(struct UniqueKeyContextData));
+		context->uniquekey = lfirst_node(UniqueKey, lc);
+		context->added_to_joinrel = false;
+		context->useful = true;
+		res = lappend(res, context);
+	}
+	return res;
+}
+
+
+/*
+ * get_exprs_from_uniquekey
+ *	Unify the way of get List of exprs from a one-row UniqueKey or
+ * normal UniqueKey. for the onerow case, every expr in rel1 is a valid
+ * UniqueKey. Return a List of exprs.
+ *
+ * rel1: The relation which you want to get the exprs.
+ * ukey: The UniqueKey you want to get the exprs.
+ */
+static List *
+get_exprs_from_uniquekey(PlannerInfo *root, RelOptInfo *joinrel,
+						 RelOptInfo *rel1, UniqueKey *ukey)
+{
+	ListCell *lc;
+	bool onerow = rel1 != NULL && relation_is_onerow(rel1);
+
+	List	*res = NIL;
+	Assert(onerow || ukey);
+	if (onerow)
+	{
+		/* Only cares about the exprs still exist in joinrel */
+		foreach(lc, joinrel->reltarget->exprs)
+		{
+			Bitmapset *relids = pull_varnos(root, lfirst(lc));
+			if (bms_is_subset(relids, rel1->relids))
+			{
+				res = lappend(res, list_make1(lfirst(lc)));
+			}
+		}
+	}
+	else
+	{
+		res = list_make1(ukey->exprs);
+	}
+	return res;
+}
+
+/*
+ * Partitioned table Unique Keys.
+ * The partition table unique key is maintained as:
+ * 1. The index must be unique as usual.
+ * 2. The index must contains partition key.
+ * 3. The index must exist on all the child rel. see simple_indexinfo_equal for
+ *    how we compare it.
+ */
+
+/*
+ * index_constains_partkey
+ * return true if the index contains the partiton key.
+ */
+static bool
+index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind)
+{
+	ListCell	*lc;
+	int	i;
+	Assert(IS_PARTITIONED_REL(partrel));
+	Assert(partrel->part_scheme->partnatts > 0);
+
+	for(i = 0; i < partrel->part_scheme->partnatts; i++)
+	{
+		Node *part_expr = linitial(partrel->partexprs[i]);
+		bool found_in_index = false;
+		foreach(lc, ind->indextlist)
+		{
+			Expr *index_expr = lfirst_node(TargetEntry, lc)->expr;
+			if (equal(index_expr, part_expr))
+			{
+				found_in_index = true;
+				break;
+			}
+		}
+		if (!found_in_index)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * simple_indexinfo_equal
+ *
+ * Used to check if the 2 index is same as each other. The index here
+ * is COPIED from childrel and did some tiny changes(see
+ * simple_copy_indexinfo_to_parent)
+ */
+static bool
+simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2)
+{
+	Size oid_cmp_len = sizeof(Oid) * ind1->ncolumns;
+
+	return ind1->ncolumns == ind2->ncolumns &&
+		ind1->unique == ind2->unique &&
+		memcmp(ind1->indexkeys, ind2->indexkeys, sizeof(int) * ind1->ncolumns) == 0 &&
+		memcmp(ind1->opfamily, ind2->opfamily, oid_cmp_len) == 0 &&
+		memcmp(ind1->opcintype, ind2->opcintype, oid_cmp_len) == 0 &&
+		memcmp(ind1->sortopfamily, ind2->sortopfamily, oid_cmp_len) == 0 &&
+		equal(get_tlist_exprs(ind1->indextlist, true),
+			  get_tlist_exprs(ind2->indextlist, true));
+}
+
+
+/*
+ * The below macros are used for simple_copy_indexinfo_to_parent which is so
+ * customized that I don't want to put it to copyfuncs.c. So copy it here.
+ */
+#define COPY_POINTER_FIELD(fldname, sz) \
+	do { \
+		Size	_size = (sz); \
+		newnode->fldname = palloc(_size); \
+		memcpy(newnode->fldname, from->fldname, _size); \
+	} while (0)
+
+#define COPY_NODE_FIELD(fldname) \
+	(newnode->fldname = copyObjectImpl(from->fldname))
+
+#define COPY_SCALAR_FIELD(fldname) \
+	(newnode->fldname = from->fldname)
+
+
+/*
+ * simple_copy_indexinfo_to_parent (from partition)
+ * Copy the IndexInfo from child relation to parent relation with some modification,
+ * which is used to test:
+ * 1. If the same index exists in all the childrels.
+ * 2. If the parentrel->reltarget/basicrestrict info matches this index.
+ */
+static IndexOptInfo *
+simple_copy_indexinfo_to_parent(PlannerInfo *root,
+								RelOptInfo *parentrel,
+								IndexOptInfo *from)
+{
+	IndexOptInfo *newnode = makeNode(IndexOptInfo);
+	AppendRelInfo *appinfo = find_appinfo_by_child(root, from->rel->relid);
+	ListCell	*lc;
+	int	idx = 0;
+
+	COPY_SCALAR_FIELD(ncolumns);
+	COPY_SCALAR_FIELD(nkeycolumns);
+	COPY_SCALAR_FIELD(unique);
+	COPY_SCALAR_FIELD(immediate);
+	/* We just need to know if it is NIL or not */
+	COPY_SCALAR_FIELD(indpred);
+	COPY_SCALAR_FIELD(predOK);
+	COPY_POINTER_FIELD(indexkeys, from->ncolumns * sizeof(int));
+	COPY_POINTER_FIELD(indexcollations, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opfamily, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opcintype, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(sortopfamily, from->ncolumns * sizeof(Oid));
+	COPY_NODE_FIELD(indextlist);
+
+	/* Convert index exprs on child expr to expr on parent */
+	foreach(lc, newnode->indextlist)
+	{
+		TargetEntry *tle = lfirst_node(TargetEntry, lc);
+		/* Index on expression is ignored */
+		Assert(IsA(tle->expr, Var));
+		tle->expr = (Expr *) find_parent_var(appinfo, (Var *) tle->expr);
+		newnode->indexkeys[idx] = castNode(Var, tle->expr)->varattno;
+		idx++;
+	}
+	newnode->rel = parentrel;
+	return newnode;
+}
+
+/*
+ * adjust_partition_unique_indexlist
+ *
+ * global_unique_indexes: At the beginning, it contains the copy & modified
+ * unique index from the first partition. And then check if each index in it still
+ * exists in the following partitions. If no, remove it. at last, it has an
+ * index list which exists in all the partitions.
+ */
+static void
+adjust_partition_unique_indexlist(PlannerInfo *root,
+								  RelOptInfo *parentrel,
+								  RelOptInfo *childrel,
+								  List **global_unique_indexes)
+{
+	ListCell	*lc, *lc2;
+	foreach(lc, *global_unique_indexes)
+	{
+		IndexOptInfo	*g_ind = lfirst_node(IndexOptInfo, lc);
+		bool found_in_child = false;
+
+		foreach(lc2, childrel->indexlist)
+		{
+			IndexOptInfo   *p_ind = lfirst_node(IndexOptInfo, lc2);
+			IndexOptInfo   *p_ind_copy;
+			if (!p_ind->unique || !p_ind->immediate ||
+				(p_ind->indpred != NIL && !p_ind->predOK))
+				continue;
+			p_ind_copy = simple_copy_indexinfo_to_parent(root, parentrel, p_ind);
+			if (simple_indexinfo_equal(p_ind_copy, g_ind))
+			{
+				found_in_child = true;
+				break;
+			}
+		}
+		if (!found_in_child)
+			/* The index doesn't exist in childrel, remove it from global_unique_indexes */
+			*global_unique_indexes = foreach_delete_current(*global_unique_indexes, lc);
+	}
+}
+
+/* Helper function for groupres/distinctrel */
+static void
+add_uniquekey_from_sortgroups(PlannerInfo *root, RelOptInfo *rel, List *sortgroups)
+{
+	Query *parse = root->parse;
+	List	*exprs;
+
+	/*
+	 * XXX: If there are some vars which is not in current levelsup, the semantic is
+	 * imprecise, should we avoid it or not? levelsup = 1 is just a demo, maybe we need to
+	 * check every level other than 0, if so, looks we have to write another
+	 * pull_var_walker.
+	 */
+	List	*upper_vars = pull_vars_of_level((Node*)sortgroups, 1);
+
+	if (upper_vars != NIL)
+		return;
+
+	exprs = get_sortgrouplist_exprs(sortgroups, parse->targetList);
+	rel->uniquekeys = lappend(rel->uniquekeys,
+							  makeUniqueKey(exprs,
+											false /* sortgroupclause can't be multi_nullvals */));
+}
+
+
+/*
+ * add_combined_uniquekey
+ * The combination of both UniqueKeys is a valid UniqueKey for joinrel no matter
+ * the jointype.
+ */
+bool
+add_combined_uniquekey(PlannerInfo *root,
+					   RelOptInfo *joinrel,
+					   RelOptInfo *outer_rel,
+					   RelOptInfo *inner_rel,
+					   UniqueKey *outer_ukey,
+					   UniqueKey *inner_ukey,
+					   JoinType jointype)
+{
+
+	ListCell	*lc1, *lc2;
+
+	/* Either side has multi_nullvals or we have outer join,
+	 * the combined UniqueKey has multi_nullvals */
+	bool multi_nullvals = outer_ukey->multi_nullvals ||
+		inner_ukey->multi_nullvals || IS_OUTER_JOIN(jointype);
+
+	/* The only case we can get onerow joinrel after join */
+	if  (relation_is_onerow(outer_rel)
+		 && relation_is_onerow(inner_rel)
+		 && jointype == JOIN_INNER)
+	{
+		add_uniquekey_for_onerow(joinrel);
+		return true;
+	}
+
+	foreach(lc1, get_exprs_from_uniquekey(root, joinrel, outer_rel, outer_ukey))
+	{
+		foreach(lc2, get_exprs_from_uniquekey(root, joinrel, inner_rel, inner_ukey))
+		{
+			List *exprs = list_concat_copy(lfirst_node(List, lc1), lfirst_node(List, lc2));
+			joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+										  makeUniqueKey(exprs,
+														multi_nullvals));
+		}
+	}
+	return false;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 424d25cbd5..8d8e493f5c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2387,6 +2387,8 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 		add_path(final_rel, path);
 	}
 
+	simple_copy_uniquekeys(current_rel, final_rel);
+
 	/*
 	 * Generate partial paths for final_rel, too, if outer query levels might
 	 * be able to make use of them.
@@ -3899,6 +3901,8 @@ create_grouping_paths(PlannerInfo *root,
 	}
 
 	set_cheapest(grouped_rel);
+
+	populate_grouprel_uniquekeys(root, grouped_rel, input_rel);
 	return grouped_rel;
 }
 
@@ -4621,7 +4625,7 @@ create_window_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(window_rel);
-
+	simple_copy_uniquekeys(input_rel, window_rel);
 	return window_rel;
 }
 
@@ -4935,7 +4939,7 @@ create_distinct_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(distinct_rel);
-
+	populate_distinctrel_uniquekeys(root, input_rel, distinct_rel);
 	return distinct_rel;
 }
 
@@ -5196,6 +5200,8 @@ create_ordered_paths(PlannerInfo *root,
 	 */
 	Assert(ordered_rel->pathlist != NIL);
 
+	simple_copy_uniquekeys(input_rel, ordered_rel);
+
 	return ordered_rel;
 }
 
@@ -6073,6 +6079,9 @@ adjust_paths_for_srfs(PlannerInfo *root, RelOptInfo *rel,
 	if (list_length(targets) == 1)
 		return;
 
+	/* UniqueKey is not valid after handling the SRF. */
+	rel->uniquekeys = NIL;
+
 	/*
 	 * Stack SRF-evaluation nodes atop each path for the rel.
 	 *
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index becdcbb872..b7626545bf 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -689,6 +689,8 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/* Undo effects of possibly forcing tuple_fraction to 0 */
 	root->tuple_fraction = save_fraction;
 
+	/* Add the UniqueKeys */
+	populate_unionrel_uniquekeys(root, result_rel);
 	return result_rel;
 }
 
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index 86922a273c..6817c9c787 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -746,3 +746,47 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
 	}
 	return appinfos;
 }
+
+/*
+ * find_appinfo_by_child
+ *
+ */
+AppendRelInfo *
+find_appinfo_by_child(PlannerInfo *root, Index child_index)
+{
+	ListCell	*lc;
+	foreach(lc, root->append_rel_list)
+	{
+		AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
+		if (appinfo->child_relid == child_index)
+			return appinfo;
+	}
+	elog(ERROR, "parent relation cant be found");
+	return NULL;
+}
+
+/*
+ * find_parent_var
+ *
+ */
+Var *
+find_parent_var(AppendRelInfo *appinfo, Var *child_var)
+{
+	ListCell	*lc;
+	Var	*res = NULL;
+	Index attno = 1;
+	foreach(lc, appinfo->translated_vars)
+	{
+		Node *child_node = lfirst(lc);
+		if (equal(child_node, child_var))
+		{
+			res = copyObject(child_var);
+			res->varattno = attno;
+			res->varno = appinfo->parent_relid;
+		}
+		attno++;
+	}
+	if (res == NULL)
+		elog(ERROR, "parent var can't be found.");
+	return res;
+}
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index be1c9ddd96..3eec1f4d74 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -736,6 +736,7 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 		{
 			Node	   *onecq = (Node *) lfirst(lc2);
 			bool		pseudoconstant;
+			RestrictInfo	*child_rinfo;
 
 			/* check for pseudoconstant (no Vars or volatile functions) */
 			pseudoconstant =
@@ -747,14 +748,15 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 				root->hasPseudoConstantQuals = true;
 			}
 			/* reconstitute RestrictInfo with appropriate properties */
-			childquals = lappend(childquals,
-								 make_restrictinfo(root,
-												   (Expr *) onecq,
-												   rinfo->is_pushed_down,
-												   rinfo->outerjoin_delayed,
-												   pseudoconstant,
-												   rinfo->security_level,
-												   NULL, NULL, NULL));
+			child_rinfo =  make_restrictinfo(root,
+											 (Expr *) onecq,
+											 rinfo->is_pushed_down,
+											 rinfo->outerjoin_delayed,
+											 pseudoconstant,
+											 rinfo->security_level,
+											 NULL, NULL, NULL);
+			child_rinfo->mergeopfamilies = rinfo->mergeopfamilies;
+			childquals = lappend(childquals, child_rinfo);
 			/* track minimum security level among child quals */
 			cq_min_security = Min(cq_min_security, rinfo->security_level);
 		}
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 48a7ebfe45..4fe1824eb0 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -16,6 +16,7 @@
 
 #include "nodes/execnodes.h"
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 
 extern A_Expr *makeA_Expr(A_Expr_Kind kind, List *name,
@@ -106,4 +107,6 @@ extern GroupingSet *makeGroupingSet(GroupingSetKind kind, List *content, int loc
 
 extern VacuumRelation *makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols);
 
+extern UniqueKey* makeUniqueKey(List *exprs, bool multi_nullvals);
+
 #endif							/* MAKEFUNC_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index e22df890ef..3320273ac1 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -265,6 +265,7 @@ typedef enum NodeTag
 	T_EquivalenceMember,
 	T_PathKey,
 	T_PathTarget,
+	T_UniqueKey,
 	T_RestrictInfo,
 	T_IndexClause,
 	T_PlaceHolderVar,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0d61f04d27..80561b1003 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -739,6 +739,7 @@ typedef struct RelOptInfo
 	QualCost	baserestrictcost;	/* cost of evaluating the above */
 	Index		baserestrict_min_security;	/* min security_level found in
 											 * baserestrictinfo */
+	List	   *uniquekeys;		/* List of UniqueKey */
 	List	   *joininfo;		/* RestrictInfo structures for join clauses
 								 * involving this rel */
 	bool		has_eclass_joins;	/* T means joininfo is incomplete */
@@ -1059,6 +1060,28 @@ typedef struct PathKey
 } PathKey;
 
 
+/*
+ * UniqueKey
+ *
+ * Represents the unique properties held by a RelOptInfo.
+ *
+ * exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+ * is a special case of UniqueKey, which means there is only 1 row in that
+ * relation.
+ * multi_nullvals: true means multi null values may exist in these exprs, so the
+ * uniqueness is not guaranteed in this case. This field is necessary for
+ * remove_useless_join & reduce_unique_semijoins where we don't mind these
+ * duplicated NULL values. It is set to true for 2 cases. One is a unique key
+ * from a unique index but the related column is nullable. The other one is for
+ * outer join. see populate_joinrel_uniquekeys for detail.
+ */
+typedef struct UniqueKey
+{
+	NodeTag		type;
+	List	   *exprs;
+	bool		multi_nullvals;
+} UniqueKey;
+
 /*
  * PathTarget
  *
@@ -2496,7 +2519,7 @@ typedef enum
  *
  * flags indicating what kinds of grouping are possible.
  * partial_costs_set is true if the agg_partial_costs and agg_final_costs
- * 		have been initialized.
+ *		have been initialized.
  * agg_partial_costs gives partial aggregation costs.
  * agg_final_costs gives finalization costs.
  * target_parallel_safe is true if target is parallel safe.
@@ -2526,8 +2549,8 @@ typedef struct
  * limit_tuples is an estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate.
  * count_est and offset_est are the estimated values of the LIMIT and OFFSET
- * 		expressions computed by preprocess_limit() (see comments for
- * 		preprocess_limit() for more information).
+ *		expressions computed by preprocess_limit() (see comments for
+ *		preprocess_limit() for more information).
  */
 typedef struct
 {
diff --git a/src/include/nodes/pg_list.h b/src/include/nodes/pg_list.h
index 404e03f132..85c65e910c 100644
--- a/src/include/nodes/pg_list.h
+++ b/src/include/nodes/pg_list.h
@@ -566,6 +566,16 @@ extern pg_nodiscard List *list_delete_first(List *list);
 extern pg_nodiscard List *list_delete_last(List *list);
 extern pg_nodiscard List *list_delete_nth_cell(List *list, int n);
 extern pg_nodiscard List *list_delete_cell(List *list, ListCell *cell);
+extern bool list_is_subset(const List *members, const List *target);
+
+extern List *list_delete(List *list, void *datum);
+extern List *list_delete_ptr(List *list, void *datum);
+extern List *list_delete_int(List *list, int datum);
+extern List *list_delete_oid(List *list, Oid datum);
+extern List *list_delete_first(List *list);
+extern List *list_delete_last(List *list);
+extern List *list_delete_nth_cell(List *list, int n);
+extern List *list_delete_cell(List *list, ListCell *cell);
 
 extern List *list_union(const List *list1, const List *list2);
 extern List *list_union_ptr(const List *list1, const List *list2);
diff --git a/src/include/optimizer/appendinfo.h b/src/include/optimizer/appendinfo.h
index 4cbf8c26cc..43b8d05f69 100644
--- a/src/include/optimizer/appendinfo.h
+++ b/src/include/optimizer/appendinfo.h
@@ -32,4 +32,7 @@ extern Relids adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
 extern AppendRelInfo **find_appinfos_by_relids(PlannerInfo *root,
 											   Relids relids, int *nappinfos);
 
+extern AppendRelInfo *find_appinfo_by_child(PlannerInfo *root, Index child_index);
+extern Var *find_parent_var(AppendRelInfo *appinfo, Var *child_var);
+
 #endif							/* APPENDINFO_H */
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index d587952b7d..843aafa51e 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -23,6 +23,7 @@
 #define OPTIMIZER_H
 
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 /* Test if an expression node represents a SRF call.  Beware multiple eval! */
 #define IS_SRF_CALL(node) \
@@ -168,6 +169,7 @@ extern TargetEntry *get_sortgroupref_tle(Index sortref,
 										 List *targetList);
 extern TargetEntry *get_sortgroupclause_tle(SortGroupClause *sgClause,
 											List *targetList);
+extern Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 extern Node *get_sortgroupclause_expr(SortGroupClause *sgClause,
 									  List *targetList);
 extern List *get_sortgrouplist_exprs(List *sgClauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 035d3e1206..1adf99c9ee 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -247,5 +247,48 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 									   int strategy, bool nulls_first);
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 									List *live_childrels);
+extern List *select_mergejoin_clauses(PlannerInfo *root,
+									  RelOptInfo *joinrel,
+									  RelOptInfo *outerrel,
+									  RelOptInfo *innerrel,
+									  List *restrictlist,
+									  JoinType jointype,
+									  bool *mergejoin_allowed);
+
+/*
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ */
+extern void populate_baserel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *baserel,
+										List* unique_index_list);
+extern void populate_partitionedrel_uniquekeys(PlannerInfo *root,
+												RelOptInfo *rel,
+												List *childrels);
+extern void populate_distinctrel_uniquekeys(PlannerInfo *root,
+											RelOptInfo *inputrel,
+											RelOptInfo *distinctrel);
+extern void populate_grouprel_uniquekeys(PlannerInfo *root,
+										 RelOptInfo *grouprel,
+										 RelOptInfo *inputrel);
+extern void populate_unionrel_uniquekeys(PlannerInfo *root,
+										  RelOptInfo *unionrel);
+extern void simple_copy_uniquekeys(RelOptInfo *oldrel,
+								   RelOptInfo *newrel);
+extern void convert_subquery_uniquekeys(PlannerInfo *root,
+										RelOptInfo *currel,
+										RelOptInfo *sub_final_rel);
+extern void populate_joinrel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *joinrel,
+										RelOptInfo *rel1,
+										RelOptInfo *rel2,
+										List *restrictlist,
+										JoinType jointype);
+
+extern bool relation_has_uniquekeys_for(PlannerInfo *root,
+										RelOptInfo *rel,
+										List *exprs,
+										bool allow_multinulls);
+extern bool relation_is_onerow(RelOptInfo *rel);
 
 #endif							/* PATHS_H */
-- 
2.26.2

v38-0003-Extend-UniqueKeys.patchtext/x-diff; charset=us-asciiDownload

From 7fb5f631308279a327224697d117a16ad65cbb06 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:33:56 +0200
Subject: [PATCH v38 3/6] Extend UniqueKeys

Prepares index skip scan implementation using UniqueKeys. Allows to
specify what are the "requested" keys that should be unique, and add
them to necessary Paths to make them useful later.

Proposed by David Rowley, contains few bits out of previous version from
Jesper Pedersen.
---
 src/backend/optimizer/path/pathkeys.c   | 62 +++++++++++++++++++++++++
 src/backend/optimizer/path/uniquekeys.c | 61 ++++++++++++++++++++++++
 src/backend/optimizer/plan/planner.c    | 36 +++++++++++++-
 src/backend/optimizer/util/pathnode.c   | 32 +++++++++----
 src/include/nodes/pathnodes.h           |  5 ++
 src/include/optimizer/pathnode.h        |  1 +
 src/include/optimizer/paths.h           |  8 ++++
 7 files changed, 195 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 139278829b..4c90f63705 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
 #include "utils/lsyscache.h"
 
 
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
@@ -95,6 +96,29 @@ make_canonical_pathkey(PlannerInfo *root,
 	return pk;
 }
 
+/*
+ * pathkey_is_unique
+ *	   Checks if the new pathkey's equivalence class is the same as that of
+ *     any existing member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+	EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+	ListCell   *lc;
+
+	/* If same EC already is already in the list, then not unique */
+	foreach(lc, pathkeys)
+	{
+		PathKey    *old_pathkey = (PathKey *) lfirst(lc);
+
+		if (new_ec == old_pathkey->pk_eclass)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * pathkey_is_redundant
  *	   Is a pathkey redundant with one already in the given list?
@@ -1151,6 +1175,44 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
 	return pathkeys;
 }
 
+/*
+ * make_pathkeys_for_uniquekeyclauses
+ *		Generate a pathkeys list to be used for uniquekey clauses
+ */
+List *
+make_pathkeys_for_uniquekeys(PlannerInfo *root,
+							 List *sortclauses,
+							 List *tlist)
+{
+	List	   *pathkeys = NIL;
+	ListCell   *l;
+
+	foreach(l, sortclauses)
+	{
+		SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+		Expr	   *sortkey;
+		PathKey    *pathkey;
+
+		sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+		Assert(OidIsValid(sortcl->sortop));
+		pathkey = make_pathkey_from_sortop(root,
+										   sortkey,
+										   root->nullable_baserels,
+										   sortcl->sortop,
+										   sortcl->nulls_first,
+										   sortcl->tleSortGroupRef,
+										   true);
+
+		if (EC_MUST_BE_REDUNDANT(pathkey->pk_eclass))
+			continue;
+
+		if (pathkey_is_unique(pathkey, pathkeys))
+			pathkeys = lappend(pathkeys, pathkey);
+	}
+
+	return pathkeys;
+}
+
 /****************************************************************************
  *		PATHKEYS AND MERGECLAUSES
  ****************************************************************************/
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
index 77ed2b2eff..1aa11ec447 100644
--- a/src/backend/optimizer/path/uniquekeys.c
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -1133,3 +1133,64 @@ add_combined_uniquekey(PlannerInfo *root,
 	}
 	return false;
 }
+
+List*
+build_uniquekeys(PlannerInfo *root, List *sortclauses)
+{
+	List *result = NIL;
+	List *sortkeys;
+	ListCell *l;
+	List *exprs = NIL;
+
+	sortkeys = make_pathkeys_for_uniquekeys(root,
+											sortclauses,
+											root->processed_tlist);
+
+	/* Create a uniquekey and add it to the list */
+	foreach(l, sortkeys)
+	{
+		PathKey    *pathkey = (PathKey *) lfirst(l);
+		EquivalenceClass *ec = pathkey->pk_eclass;
+		EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members));
+		exprs = lappend(exprs, mem->em_expr);
+	}
+
+	result = lappend(result, makeUniqueKey(exprs, false));
+
+	return result;
+}
+
+bool
+query_has_uniquekeys_for(PlannerInfo *root, List *pathuniquekeys,
+						 bool allow_multinulls)
+{
+	ListCell *lc;
+	ListCell *lc2;
+
+	/* root->query_uniquekeys are the requested DISTINCT clauses on query level
+	 * pathuniquekeys are the unique keys on current path.
+	 * All requested query_uniquekeys must be satisfied by the pathuniquekeys
+	 */
+	foreach(lc, root->query_uniquekeys)
+	{
+		UniqueKey *query_ukey = lfirst_node(UniqueKey, lc);
+		bool satisfied = false;
+		foreach(lc2, pathuniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc2);
+			if (ukey->multi_nullvals && !allow_multinulls)
+				continue;
+			if (list_length(ukey->exprs) == 0 &&
+				list_length(query_ukey->exprs) != 0)
+				continue;
+			if (list_is_subset(ukey->exprs, query_ukey->exprs))
+			{
+				satisfied = true;
+				break;
+			}
+		}
+		if (!satisfied)
+			return false;
+	}
+	return true;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 8d8e493f5c..35d8fd0cb0 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3625,12 +3625,18 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 
 	if (parse->distinctClause &&
 		grouping_is_sortable(parse->distinctClause))
+	{
 		root->distinct_pathkeys =
 			make_pathkeys_for_sortclauses(root,
 										  parse->distinctClause,
 										  tlist);
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
+	{
 		root->distinct_pathkeys = NIL;
+		root->query_uniquekeys = NIL;
+	}
 
 	root->sort_pathkeys =
 		make_pathkeys_for_sortclauses(root,
@@ -4845,13 +4851,19 @@ create_distinct_paths(PlannerInfo *root,
 			Path	   *path = (Path *) lfirst(lc);
 
 			if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
-			{
 				add_path(distinct_rel, (Path *)
 						 create_upper_unique_path(root, distinct_rel,
 												  path,
 												  list_length(root->distinct_pathkeys),
 												  numDistinctRows));
-			}
+		}
+
+		foreach(lc, input_rel->unique_pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+
+			if (query_has_uniquekeys_for(root, path->uniquekeys, false))
+				add_path(distinct_rel, path);
 		}
 
 		/* For explicit-sort case, always use the more rigorous clause */
@@ -7491,6 +7503,26 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		}
 	}
 
+	foreach(lc, rel->unique_pathlist)
+	{
+		Path	   *subpath = (Path *) lfirst(lc);
+
+		/* Shouldn't have any parameterized paths anymore */
+		Assert(subpath->param_info == NULL);
+
+		if (tlist_same_exprs)
+			subpath->pathtarget->sortgrouprefs =
+				scanjoin_target->sortgrouprefs;
+		else
+		{
+			Path	   *newpath;
+
+			newpath = (Path *) create_projection_path(root, rel, subpath,
+													  scanjoin_target);
+			lfirst(lc) = newpath;
+		}
+	}
+
 	/*
 	 * Now, if final scan/join target contains SRFs, insert ProjectSetPath(s)
 	 * atop each existing path.  (Note that this function doesn't look at the
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 69b83071cf..c995921c88 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -416,10 +416,10 @@ set_cheapest(RelOptInfo *parent_rel)
  * 'parent_rel' is the relation entry to which the path corresponds.
  * 'new_path' is a potential path for parent_rel.
  *
- * Returns nothing, but modifies parent_rel->pathlist.
+ * Returns modified pathlist.
  */
-void
-add_path(RelOptInfo *parent_rel, Path *new_path)
+static List *
+add_path_to(RelOptInfo *parent_rel, List *pathlist, Path *new_path)
 {
 	bool		accept_new = true;	/* unless we find a superior old path */
 	int			insert_at = 0;	/* where to insert new item */
@@ -440,7 +440,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	 * for more than one old path to be tossed out because new_path dominates
 	 * it.
 	 */
-	foreach(p1, parent_rel->pathlist)
+	foreach(p1, pathlist)
 	{
 		Path	   *old_path = (Path *) lfirst(p1);
 		bool		remove_old = false; /* unless new proves superior */
@@ -584,8 +584,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		 */
 		if (remove_old)
 		{
-			parent_rel->pathlist = foreach_delete_current(parent_rel->pathlist,
-														  p1);
+			pathlist = foreach_delete_current(pathlist, p1);
 
 			/*
 			 * Delete the data pointed-to by the deleted cell, if possible
@@ -612,8 +611,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	if (accept_new)
 	{
 		/* Accept the new path: insert it at proper place in pathlist */
-		parent_rel->pathlist =
-			list_insert_nth(parent_rel->pathlist, insert_at, new_path);
+		pathlist = list_insert_nth(pathlist, insert_at, new_path);
 	}
 	else
 	{
@@ -621,6 +619,23 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		if (!IsA(new_path, IndexPath))
 			pfree(new_path);
 	}
+
+	return pathlist;
+}
+
+void
+add_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->pathlist = add_path_to(parent_rel,
+									   parent_rel->pathlist, new_path);
+}
+
+void
+add_unique_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->unique_pathlist = add_path_to(parent_rel,
+											  parent_rel->unique_pathlist,
+											  new_path);
 }
 
 /*
@@ -2597,6 +2612,7 @@ create_projection_path(PlannerInfo *root,
 	pathnode->path.pathkeys = subpath->pathkeys;
 
 	pathnode->subpath = subpath;
+	pathnode->path.uniquekeys = subpath->uniquekeys;
 
 	/*
 	 * We might not need a separate Result node.  If the input plan node type
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 80561b1003..27ff639053 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -293,6 +293,7 @@ struct PlannerInfo
 
 	List	   *query_pathkeys; /* desired pathkeys for query_planner() */
 
+	List	   *query_uniquekeys; /* unique keys required for the query */
 	List	   *group_pathkeys; /* groupClause pathkeys, if any */
 	List	   *window_pathkeys;	/* pathkeys of bottom window, if any */
 	List	   *distinct_pathkeys;	/* distinctClause pathkeys, if any */
@@ -686,6 +687,7 @@ typedef struct RelOptInfo
 	List	   *pathlist;		/* Path structures */
 	List	   *ppilist;		/* ParamPathInfos used in pathlist */
 	List	   *partial_pathlist;	/* partial Paths */
+	List       *unique_pathlist;    /* unique Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
 	struct Path *cheapest_unique_path;
@@ -874,6 +876,7 @@ struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanskip;		/* can AM skip duplicate values? */
 	bool		amcanparallel;	/* does AM support parallel scan? */
 	bool		amcanmarkpos;	/* does AM support mark/restore? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
@@ -1194,6 +1197,8 @@ typedef struct Path
 
 	List	   *pathkeys;		/* sort ordering of path's output */
 	/* pathkeys is a List of PathKey nodes; see above */
+
+	List	   *uniquekeys;	/* the unique keys, or NIL if none */
 } Path;
 
 /* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 54f4b782fc..aa6c3e439e 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -27,6 +27,7 @@ extern int	compare_fractional_path_costs(Path *path1, Path *path2,
 										  double fraction);
 extern void set_cheapest(RelOptInfo *parent_rel);
 extern void add_path(RelOptInfo *parent_rel, Path *new_path);
+extern void add_unique_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_path_precheck(RelOptInfo *parent_rel,
 							  Cost startup_cost, Cost total_cost,
 							  List *pathkeys, Relids required_outer);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 1adf99c9ee..b571ddec11 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -222,6 +222,9 @@ extern List *build_join_pathkeys(PlannerInfo *root,
 extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
 										   List *sortclauses,
 										   List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+										  List *sortclauses,
+										  List *tlist);
 extern void initialize_mergeclause_eclasses(PlannerInfo *root,
 											RestrictInfo *restrictinfo);
 extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -289,6 +292,11 @@ extern bool relation_has_uniquekeys_for(PlannerInfo *root,
 										RelOptInfo *rel,
 										List *exprs,
 										bool allow_multinulls);
+extern bool query_has_uniquekeys_for(PlannerInfo *root,
+									 List *exprs,
+									 bool allow_multinulls);
 extern bool relation_is_onerow(RelOptInfo *rel);
 
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+
 #endif							/* PATHS_H */
-- 
2.26.2

v38-0004-Index-skip-scan.patchtext/x-diff; charset=us-asciiDownload

From 74d1d7c91e536070deccf2176c53d57e5aa8f3a1 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:39:13 +0200
Subject: [PATCH v38 4/6] Index skip scan

Implementation of basic Index Skip Scan (see Loose Index Scan in the
wiki [1]) infrastructure on top of IndexOnlyScan and IndexScan.
Introduces a new index am function amskip to allow advance past
duplicate key values in a scan. This innocently looking description
could be a bit tricky on the edge between am specific and common parts
of the implementation, mostly due to different information available at
each level, e.g. visibility. This means the common parts should apply
skipping multiple times if necessary.

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 contrib/bloom/blutils.c                       |   1 +
 doc/src/sgml/indexam.sgml                     |   1 +
 src/backend/access/brin/brin.c                |   1 +
 src/backend/access/gin/ginutil.c              |   1 +
 src/backend/access/gist/gist.c                |   1 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/index/indexam.c            |  18 ++
 src/backend/access/spgist/spgutils.c          |   1 +
 src/backend/commands/explain.c                |  25 +++
 src/backend/executor/nodeIndexonlyscan.c      |  97 ++++++++-
 src/backend/executor/nodeIndexscan.c          |  56 ++++-
 src/backend/nodes/copyfuncs.c                 |   2 +
 src/backend/nodes/outfuncs.c                  |   2 +
 src/backend/nodes/readfuncs.c                 |   2 +
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/optimizer/path/indxpath.c         | 196 +++++++++++++++++-
 src/backend/optimizer/path/pathkeys.c         |  54 ++++-
 src/backend/optimizer/plan/createplan.c       |  20 +-
 src/backend/optimizer/util/pathnode.c         |  37 ++++
 src/backend/optimizer/util/plancat.c          |   1 +
 src/backend/utils/misc/guc.c                  |   9 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/amapi.h                    |   7 +
 src/include/access/genam.h                    |   2 +
 src/include/access/sdir.h                     |   7 +
 src/include/nodes/execnodes.h                 |   6 +
 src/include/nodes/pathnodes.h                 |   4 +
 src/include/nodes/plannodes.h                 |   4 +
 src/include/optimizer/cost.h                  |   1 +
 src/include/optimizer/pathnode.h              |   4 +
 src/include/optimizer/paths.h                 |   5 +-
 src/test/regress/expected/sysviews.out        |   3 +-
 32 files changed, 554 insertions(+), 17 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 1e505b1da5..a58bdf7604 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -134,6 +134,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = blcostestimate;
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index ec5741df6d..3442ae816b 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -151,6 +151,7 @@ typedef struct IndexAmRoutine
     amendscan_function amendscan;
     ammarkpos_function ammarkpos;       /* can be NULL */
     amrestrpos_function amrestrpos;     /* can be NULL */
+    amskip_function amskip;             /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 27ba596c6e..1ec10ec513 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -115,6 +115,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = brincostestimate;
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf42..d776d2732f 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -66,6 +66,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = gincostestimate;
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..fed061184e 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -87,6 +87,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = gistcostestimate;
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a9..fd7c13ee4c 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -84,6 +84,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = hashcostestimate;
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 3d2dbed708..2544ea24f1 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_skip		- advance past duplicate key values in a scan
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -739,6 +740,23 @@ index_can_return(Relation indexRelation, int attno)
 	return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
 }
 
+/* ----------------
+ *		index_skip
+ *
+ *		Skip past all tuples where the first 'prefix' columns have the
+ *		same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction,
+		   ScanDirection indexdir, bool scanstart, int prefix)
+{
+	SCAN_CHECKS;
+
+	return scan->indexRelation->rd_indam->amskip(scan, direction,
+												 indexdir, prefix);
+}
+
 /* ----------------
  *		index_getprocid
  *
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index d8b1815061..b2ed0712f2 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -69,6 +69,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = spgcostestimate;
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index afc45429ba..a160de5493 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -149,6 +149,7 @@ static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
 static void ExplainIndentText(ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 
 
@@ -1098,6 +1099,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
 }
 
+/*
+ * ExplainIndexSkipScanKeys -
+ *	  Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es)
+{
+	if (skipPrefixSize > 0)
+	{
+		if (es->format != EXPLAIN_FORMAT_TEXT)
+			ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+	}
+}
+
 /*
  * ExplainNode -
  *	  Appends a description of a plan tree to es->str
@@ -1439,6 +1456,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexScan  *indexscan = (IndexScan *) plan;
 
+				ExplainIndexSkipScanKeys(indexscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexscan->indexid,
 										indexscan->indexorderdir,
 										es);
@@ -1449,6 +1468,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
 
+				ExplainIndexSkipScanKeys(indexonlyscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexonlyscan->indexid,
 										indexonlyscan->indexorderdir,
 										es);
@@ -1709,6 +1730,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_IndexScan:
+			if (((IndexScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyBool("Skip scan", true, es);
 			show_scan_qual(((IndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexScan *) plan)->indexqualorig)
@@ -1722,6 +1745,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 										   planstate, es);
 			break;
 		case T_IndexOnlyScan:
+			if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyBool("Skip scan", true, es);
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 0754e28a9a..0fad258202 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -41,6 +41,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/itemptr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -62,9 +63,26 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	EState	   *estate;
 	ExprContext *econtext;
 	ScanDirection direction;
+	ScanDirection readDirection;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	ItemPointerData startTid;
+	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
+
+	/*
+	 * Tells if the current position was reached via skipping. In this case
+	 * there is no nead for the index_getnext_tid
+	 */
+	bool skipped = false;
+
+	/*
+	 * Index only scan must be aware that in case of skipping we can return to
+	 * the starting point due to visibility checks. In this situation we need
+	 * to jump further, and number of skipping attempts tell us how far do we
+	 * need to do so.
+	 */
+	int skipAttempts = 0;
 
 	/*
 	 * extract necessary information from index scan node
@@ -72,7 +90,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -114,16 +132,87 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
 	}
+	else
+	{
+		ItemPointerCopy(&scandesc->xs_heaptid, &startTid);
+	}
+
+	/*
+	 * Check if we need to skip to the next key prefix, because we've been
+	 * asked to implement DISTINCT.
+	 *
+	 * When fetching a cursor in the direction opposite to a general scan
+	 * direction, the result must be what normal fetching should have
+	 * returned, but in reversed order. In other words, return the last or
+	 * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+	 * Due to that we skip also when the first tuple wasn't emitted yet, but
+	 * the directions are opposite.
+	 */
+	if (node->ioss_SkipPrefixSize > 0 &&
+		(node->ioss_FirstTupleEmitted ||
+		 ScanDirectionsAreOpposite(direction, indexonlyscan->indexorderdir)))
+	{
+		if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir,
+						!node->ioss_FirstTupleEmitted, node->ioss_SkipPrefixSize))
+		{
+			/*
+			 * Reached end of index. At this point currPos is invalidated, and
+			 * we need to reset ioss_FirstTupleEmitted, since otherwise after
+			 * going backwards, reaching the end of index, and going forward
+			 * again we apply skip again. It would be incorrect and lead to an
+			 * extra skipped item.
+			 */
+			node->ioss_FirstTupleEmitted = false;
+			return ExecClearTuple(slot);
+		}
+		else
+		{
+			skipAttempts = 1;
+			skipped = true;
+			tid = &scandesc->xs_heaptid;
+		}
+	}
+
+	readDirection = skipped ? indexonlyscan->indexorderdir : direction;
 
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (skipped || (tid = index_getnext_tid(scandesc, readDirection)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * If we already emitted first tuple, while doing index only skip scan
+		 * with advancing and reading in different directions we can return to
+		 * the same position where we started after visibility check. Recognize
+		 * such situations and skip more.
+		 */
+		if ((readDirection != direction) && node->ioss_FirstTupleEmitted &&
+			ItemPointerIsValid(&startTid) && ItemPointerEquals(&startTid, tid))
+		{
+			int i;
+			skipAttempts += 1;
+
+			for (i = 0; i < skipAttempts; i++)
+			{
+				if (!index_skip(scandesc, direction,
+								indexonlyscan->indexorderdir,
+								!node->ioss_FirstTupleEmitted,
+								node->ioss_SkipPrefixSize))
+				{
+					node->ioss_FirstTupleEmitted = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			tid = &scandesc->xs_heaptid;
+		}
+
+		skipped = false;
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -250,6 +339,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 							  ItemPointerGetBlockNumber(tid),
 							  estate->es_snapshot);
 
+		node->ioss_FirstTupleEmitted = true;
+
 		return slot;
 	}
 
@@ -504,6 +595,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+	indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->ioss_FirstTupleEmitted = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 2fffb1b437..71aac4493d 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,6 +85,13 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
+
+	/*
+	 * tells if the current position was reached via skipping. In this case
+	 * there is no nead for the index_getnext_tid
+	 */
+	bool skipped = false;
 
 	/*
 	 * extract necessary information from index scan node
@@ -92,7 +99,7 @@ IndexNext(IndexScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -117,6 +124,12 @@ IndexNext(IndexScanState *node)
 
 		node->iss_ScanDesc = scandesc;
 
+		/* Index skip scan assumes xs_want_itup, so set it to true */
+		if (indexscan->indexskipprefixsize > 0)
+			node->iss_ScanDesc->xs_want_itup = true;
+		else
+			node->iss_ScanDesc->xs_want_itup = false;
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -127,12 +140,48 @@ IndexNext(IndexScanState *node)
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	}
 
+	/*
+	 * Check if we need to skip to the next key prefix, because we've been
+	 * asked to implement DISTINCT.
+	 *
+	 * When fetching a cursor in the direction opposite to a general scan
+	 * direction, the result must be what normal fetching should have
+	 * returned, but in reversed order. In other words, return the last or
+	 * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+	 * Due to that we skip also when the first tuple wasn't emitted yet, but
+	 * the directions are opposite.
+	 */
+	if (node->iss_SkipPrefixSize > 0 &&
+		(node->iss_FirstTupleEmitted ||
+		 ScanDirectionsAreOpposite(direction, indexscan->indexorderdir)))
+	{
+		if (!index_skip(scandesc, direction, indexscan->indexorderdir,
+					   !node->iss_FirstTupleEmitted, node->iss_SkipPrefixSize))
+		{
+			/*
+			 * Reached end of index. At this point currPos is invalidated, and
+			 * we need to reset iss_FirstTupleEmitted, since otherwise after
+			 * going backwards, reaching the end of index, and going forward
+			 * again we apply skip again. It would be incorrect and lead to an
+			 * extra skipped item.
+			 */
+			node->iss_FirstTupleEmitted = false;
+			return ExecClearTuple(slot);
+		}
+		else
+		{
+			skipped = true;
+			index_fetch_heap(scandesc, slot);
+		}
+	}
+
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (skipped || index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
+		skipped = false;
 
 		/*
 		 * If the index was lossy, we have to recheck the index quals using
@@ -149,6 +198,7 @@ IndexNext(IndexScanState *node)
 			}
 		}
 
+		node->iss_FirstTupleEmitted = true;
 		return slot;
 	}
 
@@ -910,6 +960,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+	indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->iss_FirstTupleEmitted = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 75c1c5e824..481dbf2e00 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -491,6 +491,7 @@ _copyIndexScan(const IndexScan *from)
 	COPY_NODE_FIELD(indexorderbyorig);
 	COPY_NODE_FIELD(indexorderbyops);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
@@ -516,6 +517,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
 	COPY_NODE_FIELD(indexorderby);
 	COPY_NODE_FIELD(indextlist);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 44154cde6a..eb51324c61 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -560,6 +560,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
 	WRITE_NODE_FIELD(indexorderbyorig);
 	WRITE_NODE_FIELD(indexorderbyops);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
@@ -574,6 +575,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
 	WRITE_NODE_FIELD(indexorderby);
 	WRITE_NODE_FIELD(indextlist);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index b3e212bf1c..a23a405523 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1870,6 +1870,7 @@ _readIndexScan(void)
 	READ_NODE_FIELD(indexorderbyorig);
 	READ_NODE_FIELD(indexorderbyops);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
@@ -1889,6 +1890,7 @@ _readIndexOnlyScan(void)
 	READ_NODE_FIELD(indexorderby);
 	READ_NODE_FIELD(indextlist);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a25b674a19..658bacabf4 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -132,6 +132,7 @@ int			max_parallel_workers_per_gather = 2;
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
+bool		enable_indexskipscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index ff536e6b24..bd0a073998 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -784,6 +784,16 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexPath  *ipath = (IndexPath *) lfirst(lc);
 
+		/*
+		 * To prevent unique paths from index skip scans being potentially used
+		 * when not needed scan keep them in a separate pathlist.
+		*/
+		if (ipath->indexskipprefix != 0)
+		{
+			add_unique_path(rel, (Path *) ipath);
+			continue;
+		}
+
 		if (index->amhasgettuple)
 			add_path(rel, (Path *) ipath);
 
@@ -866,12 +876,15 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	double		loop_count;
 	List	   *orderbyclauses;
 	List	   *orderbyclausecols;
-	List	   *index_pathkeys;
+	List	   *index_pathkeys = NIL;
 	List	   *useful_pathkeys;
+	List	   *index_pathkeys_pos = NIL;
 	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
+	bool		not_empty_qual = false;
+	bool		can_skip;
 	int			indexcol;
 
 	/*
@@ -989,7 +1002,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	if (index_is_ordered && pathkeys_possibly_useful)
 	{
 		index_pathkeys = build_index_pathkeys(root, index,
-											  ForwardScanDirection);
+											  ForwardScanDirection,
+											  &index_pathkeys_pos);
 		useful_pathkeys = truncate_useless_pathkeys(root, rel,
 													index_pathkeys);
 		orderbyclauses = NIL;
@@ -1021,6 +1035,129 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	index_only_scan = (scantype != ST_BITMAPSCAN &&
 					   check_index_only(rel, index));
 
+	/* Check if an index skip scan is possible. */
+	can_skip = enable_indexskipscan & index->amcanskip;
+
+	if (can_skip)
+	{
+		/*
+		 * Skip scan is not supported when there are qual conditions, which are
+		 * not covered by index. The reason for that is that those conditions
+		 * are evaluated later, already after skipping was applied.
+		 *
+		 * TODO: This implementation is too restrictive, and doesn't allow e.g.
+		 * index expressions. For that we need to examine index_clauses too.
+		 */
+		if (root->parse->jointree != NULL)
+		{
+			ListCell *lc;
+
+			foreach(lc, (List *) root->parse->jointree->quals)
+			{
+				Node *expr, *qual = (Node *) lfirst(lc);
+				OpExpr *expr_op;
+				Var *var;
+				bool found = false;
+
+				if (!is_opclause(qual))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				expr = get_leftop(qual);
+				expr_op = (OpExpr *) qual;
+
+				if (!IsA(expr, Var))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				var = (Var *) expr;
+
+				/*
+				 * Check if the qual operator is indexable by any columns of
+				 * the index, test collation and opfamily.
+				 */
+				for (int i = 0; i < index->ncolumns; i++)
+				{
+					if (index->indexkeys[i] == var->varattno &&
+						IndexCollMatchesExprColl(index->indexcollations[i],
+												 expr_op->inputcollid) &&
+						op_in_opfamily(expr_op->opno, index->opfamily[i]))
+					{
+						found = true;
+						break;
+					}
+				}
+
+				if (!found)
+				{
+					not_empty_qual = true;
+					break;
+				}
+			}
+		}
+
+		/*
+		 * For an index scan verify that index fully covers distinct
+		 * expressions, otherwise there is not enough information for skipping
+		 */
+		if (!index_only_scan && root->query_uniquekeys != NULL)
+		{
+			ListCell *lc;
+
+			foreach(lc, root->query_uniquekeys)
+			{
+				UniqueKey *uniqueKey = (UniqueKey *) lfirst(lc);
+				ListCell *lc1;
+
+				foreach(lc1, uniqueKey->exprs)
+				{
+					Expr *expr = (Expr *) lfirst(lc1);
+					bool found = false;
+
+					if (!IsA(expr, Var))
+					{
+						ListCell *lc2;
+
+						foreach(lc2, index->indexprs)
+						{
+							if(equal(lfirst(lc1), lfirst(lc2)))
+							{
+								found = true;
+								break;
+							}
+						}
+					}
+					else
+					{
+						Var *var = (Var *) expr;
+
+						for (int i = 0; i < index->ncolumns; i++)
+						{
+							if (index->indexkeys[i] == var->varattno)
+							{
+								found = true;
+								break;
+							}
+						}
+					}
+
+					if (!found)
+					{
+						can_skip = false;
+						break;
+					}
+				}
+
+				if (!can_skip)
+					break;
+			}
+		}
+	}
+
 	/*
 	 * 4. Generate an indexscan path if there are relevant restriction clauses
 	 * in the current clauses, OR the index ordering is potentially useful for
@@ -1044,6 +1181,32 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  false);
 		result = lappend(result, ipath);
 
+		/* Consider index skip scan as well */
+		if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
+		{
+			int numusefulkeys = list_length(useful_pathkeys);
+			int numsortkeys = list_length(root->query_pathkeys);
+
+			if (numusefulkeys == numsortkeys)
+			{
+				int prefix;
+				if (list_length(root->distinct_pathkeys) > 0)
+					prefix = find_index_prefix_for_pathkey(index_pathkeys,
+														   index_pathkeys_pos,
+														   llast_node(PathKey,
+														   root->distinct_pathkeys));
+				else
+					/* all are distinct keys are constant and optimized away.
+					 * skipping with 1 is sufficient as all are constant anyway
+					 */
+					prefix = 1;
+
+				result = lappend(result,
+								 create_skipscan_unique_path(root, index,
+															 (Path *) ipath, prefix));
+			}
+		}
+
 		/*
 		 * If appropriate, consider parallel index scan.  We don't allow
 		 * parallel index scan for bitmap index scans.
@@ -1082,7 +1245,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	if (index_is_ordered && pathkeys_possibly_useful)
 	{
 		index_pathkeys = build_index_pathkeys(root, index,
-											  BackwardScanDirection);
+											  BackwardScanDirection,
+											  &index_pathkeys_pos);
 		useful_pathkeys = truncate_useless_pathkeys(root, rel,
 													index_pathkeys);
 		if (useful_pathkeys != NIL)
@@ -1099,6 +1263,32 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  false);
 			result = lappend(result, ipath);
 
+			/* Consider index skip scan as well */
+			if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
+			{
+				int numusefulkeys = list_length(useful_pathkeys);
+				int numsortkeys = list_length(root->query_pathkeys);
+
+				if (numusefulkeys == numsortkeys)
+				{
+					int prefix;
+					if (list_length(root->distinct_pathkeys) > 0)
+						prefix = find_index_prefix_for_pathkey(index_pathkeys,
+															   index_pathkeys_pos,
+															   llast_node(PathKey,
+															   root->distinct_pathkeys));
+					else
+						/* all are distinct keys are constant and optimized away.
+						 * skipping with 1 is sufficient as all are constant anyway
+						 */
+						prefix = 1;
+
+					result = lappend(result,
+									 create_skipscan_unique_path(root, index,
+																 (Path *) ipath, prefix));
+				}
+			}
+
 			/* If appropriate, consider parallel index scan */
 			if (index->amcanparallel &&
 				rel->consider_parallel && outer_relids == NULL &&
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 4c90f63705..e15637b514 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -522,6 +522,47 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  *		NEW PATHKEY FORMATION
  ****************************************************************************/
 
+/*
+ * Find the prefix size for a specific path key in an index. For example, an
+ * index with (a,b,c) finding path key b will return prefix 2. Optionally
+ * pathkeys_positions can be provided, to specify at which position in the
+ * original pathkey list this particular key could be found (this is helpful
+ * when we deal with redundant pathkeys).
+ *
+ * Returns 0 when not found.
+ */
+int
+find_index_prefix_for_pathkey(List *index_pathkeys,
+							  List *pathkeys_positions,
+							  PathKey *target_pathkey)
+{
+	ListCell   *lc;
+	int			i;
+
+	i = 0;
+	foreach(lc, index_pathkeys)
+	{
+		PathKey    *cpathkey = (PathKey *) lfirst(lc);
+
+		if (cpathkey == target_pathkey)
+		{
+			/*
+			 * Prefix expected to start from 1, increment positions since
+			 * they're 0 based.
+			 */
+			if (pathkeys_positions != NIL &&
+				pathkeys_positions->length > i)
+				return list_nth_int(pathkeys_positions, i) + 1;
+			else
+				return i + 1;
+		}
+
+		i++;
+	}
+
+	return 0;
+}
+
 /*
  * build_index_pathkeys
  *	  Build a pathkeys list that describes the ordering induced by an index
@@ -534,7 +575,9 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  * We iterate only key columns of covering indexes, since non-key columns
  * don't influence index ordering.  The result is canonical, meaning that
  * redundant pathkeys are removed; it may therefore have fewer entries than
- * there are key columns in the index.
+ * there are key columns in the index. Since by removing redundant pathkeys the
+ * information about original position is lost, return it via positions
+ * argument.
  *
  * Another reason for stopping early is that we may be able to tell that
  * an index column's sort order is uninteresting for this query.  However,
@@ -545,7 +588,8 @@ get_cheapest_parallel_safe_total_inner(List *paths)
 List *
 build_index_pathkeys(PlannerInfo *root,
 					 IndexOptInfo *index,
-					 ScanDirection scandir)
+					 ScanDirection scandir,
+					 List **positions)
 {
 	List	   *retval = NIL;
 	ListCell   *lc;
@@ -554,6 +598,8 @@ build_index_pathkeys(PlannerInfo *root,
 	if (index->sortopfamily == NULL)
 		return NIL;				/* non-orderable index */
 
+	*positions = NIL;
+
 	i = 0;
 	foreach(lc, index->indextlist)
 	{
@@ -607,7 +653,11 @@ build_index_pathkeys(PlannerInfo *root,
 			 * for this query.  Add it to list, unless it's redundant.
 			 */
 			if (!pathkey_is_redundant(cpathkey, retval))
+			{
 				retval = lappend(retval, cpathkey);
+				*positions = lappend_int(*positions,
+										 foreach_current_index(lc));
+			}
 		}
 		else
 		{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 906cab7053..c2969a0279 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -181,12 +181,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 								 Oid indexid, List *indexqual, List *indexqualorig,
 								 List *indexorderby, List *indexorderbyorig,
 								 List *indexorderbyops,
-								 ScanDirection indexscandir);
+								 ScanDirection indexscandir,
+								 int skipprefix);
 static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 Index scanrelid, Oid indexid,
 										 List *indexqual, List *indexorderby,
 										 List *indextlist,
-										 ScanDirection indexscandir);
+										 ScanDirection indexscandir,
+										 int skipprefix);
 static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
 											  List *indexqual,
 											  List *indexqualorig);
@@ -2997,7 +2999,8 @@ create_indexscan_plan(PlannerInfo *root,
 												fixed_indexquals,
 												fixed_indexorderbys,
 												best_path->indexinfo->indextlist,
-												best_path->indexscandir);
+												best_path->indexscandir,
+												best_path->indexskipprefix);
 	else
 		scan_plan = (Scan *) make_indexscan(tlist,
 											qpqual,
@@ -3008,7 +3011,8 @@ create_indexscan_plan(PlannerInfo *root,
 											fixed_indexorderbys,
 											indexorderbys,
 											indexorderbyops,
-											best_path->indexscandir);
+											best_path->indexscandir,
+											best_path->indexskipprefix);
 
 	copy_generic_path_info(&scan_plan->plan, &best_path->path);
 
@@ -5340,7 +5344,8 @@ make_indexscan(List *qptlist,
 			   List *indexorderby,
 			   List *indexorderbyorig,
 			   List *indexorderbyops,
-			   ScanDirection indexscandir)
+			   ScanDirection indexscandir,
+			   int skipPrefixSize)
 {
 	IndexScan  *node = makeNode(IndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5357,6 +5362,7 @@ make_indexscan(List *qptlist,
 	node->indexorderbyorig = indexorderbyorig;
 	node->indexorderbyops = indexorderbyops;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
@@ -5369,7 +5375,8 @@ make_indexonlyscan(List *qptlist,
 				   List *indexqual,
 				   List *indexorderby,
 				   List *indextlist,
-				   ScanDirection indexscandir)
+				   ScanDirection indexscandir,
+				   int skipPrefixSize)
 {
 	IndexOnlyScan *node = makeNode(IndexOnlyScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5384,6 +5391,7 @@ make_indexonlyscan(List *qptlist,
 	node->indexorderby = indexorderby;
 	node->indextlist = indextlist;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index c995921c88..a768f43e22 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3029,6 +3029,43 @@ create_upper_unique_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_skipscan_unique_path
+ *	  Creates a pathnode the same as an existing IndexPath except based on
+ *	  skipping duplicate values.  This may or may not be cheaper than using
+ *	  create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root, IndexOptInfo *index,
+							Path *basepath, int prefix)
+{
+	IndexPath 	*pathnode = makeNode(IndexPath);
+	int 		numDistinctRows;
+	UniqueKey *ukey;
+
+	Assert(IsA(basepath, IndexPath));
+
+	/* We don't want to modify basepath, so make a copy. */
+	memcpy(pathnode, basepath, sizeof(IndexPath));
+
+	ukey = linitial_node(UniqueKey, root->query_uniquekeys);
+
+	Assert(prefix > 0);
+	pathnode->indexskipprefix = prefix;
+	pathnode->path.uniquekeys = root->query_uniquekeys;
+
+	numDistinctRows = estimate_num_groups(root, ukey->exprs,
+										  pathnode->path.rows,
+										  NULL);
+
+	pathnode->path.total_cost = pathnode->path.startup_cost * numDistinctRows;
+	pathnode->path.rows = numDistinctRows;
+
+	return pathnode;
+}
+
 /*
  * create_agg_path
  *	  Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index eebabcfccf..cb31b64047 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -281,6 +281,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanskip = (amroutine->amskip != NULL);
 			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 855076b1fd..6354958ed3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -960,6 +960,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-skip-scan plans."),
+			NULL
+		},
+		&enable_indexskipscan,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f46c2dd7a8..4eecf2d95d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -359,6 +359,7 @@
 #enable_hashjoin = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexskipscan = on
 #enable_material = on
 #enable_mergejoin = on
 #enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index d357ebb559..eacf890dbe 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -173,6 +173,12 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+								 ScanDirection dir,
+								 ScanDirection indexdir,
+								 int prefix);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -275,6 +281,7 @@ typedef struct IndexAmRoutine
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
 	amrestrpos_function amrestrpos; /* can be NULL */
+	amskip_function amskip;				/* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 4515401869..f14baadea0 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -183,6 +183,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
 												   IndexBulkDeleteResult *stats);
 extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction,
+					   ScanDirection indexdir, bool start, int prefix);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/sdir.h b/src/include/access/sdir.h
index 8154adf3b8..6690c2a0b0 100644
--- a/src/include/access/sdir.h
+++ b/src/include/access/sdir.h
@@ -55,4 +55,11 @@ typedef enum ScanDirection
 #define ScanDirectionIsForward(direction) \
 	((bool) ((direction) == ForwardScanDirection))
 
+/*
+ * ScanDirectionsAreOpposite
+ *		True iff scan directions are backward/forward or forward/backward.
+ */
+#define ScanDirectionsAreOpposite(dirA, dirB) \
+	((bool) (dirA != NoMovementScanDirection && dirA == -dirB))
+
 #endif							/* SDIR_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e31ad6204e..9982343e8a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1427,6 +1427,8 @@ typedef struct IndexScanState
 	ExprContext *iss_RuntimeContext;
 	Relation	iss_RelationDesc;
 	struct IndexScanDescData *iss_ScanDesc;
+	int         iss_SkipPrefixSize;
+	bool		iss_FirstTupleEmitted;
 
 	/* These are needed for re-checking ORDER BY expr ordering */
 	pairingheap *iss_ReorderQueue;
@@ -1456,6 +1458,8 @@ typedef struct IndexScanState
  *		TableSlot		   slot for holding tuples fetched from the table
  *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
+ *		SkipPrefixSize	   number of keys for skip-based DISTINCT
+ *		FirstTupleEmitted  has the first tuple been emitted
  * ----------------
  */
 typedef struct IndexOnlyScanState
@@ -1474,6 +1478,8 @@ typedef struct IndexOnlyScanState
 	struct IndexScanDescData *ioss_ScanDesc;
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
+	int         ioss_SkipPrefixSize;
+	bool		ioss_FirstTupleEmitted;
 	Size		ioss_PscanLen;
 } IndexOnlyScanState;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 27ff639053..01e68fe6db 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1241,6 +1241,9 @@ typedef struct Path
  * we need not recompute them when considering using the same index in a
  * bitmap index/heap scan (see BitmapHeapPath).  The costs of the IndexPath
  * itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
  *----------
  */
 typedef struct IndexPath
@@ -1253,6 +1256,7 @@ typedef struct IndexPath
 	ScanDirection indexscandir;
 	Cost		indextotalcost;
 	Selectivity indexselectivity;
+	int			indexskipprefix;
 } IndexPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 95292d7573..a23ea622ed 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -403,6 +403,8 @@ typedef struct IndexScan
 	List	   *indexorderbyorig;	/* the same in original form */
 	List	   *indexorderbyops;	/* OIDs of sort ops for ORDER BY exprs */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for distinct
+										 * scans */
 } IndexScan;
 
 /* ----------------
@@ -430,6 +432,8 @@ typedef struct IndexOnlyScan
 	List	   *indexorderby;	/* list of index ORDER BY exprs */
 	List	   *indextlist;		/* TargetEntry list describing index's cols */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for distinct
+										 * scans */
 } IndexOnlyScan;
 
 /* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 1be93be098..0f22fcef99 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
 extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index aa6c3e439e..872c858a60 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -210,6 +210,10 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
 												 Path *subpath,
 												 int numCols,
 												 double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+											  IndexOptInfo *index,
+											  Path *subpath,
+											  int prefix);
 extern AggPath *create_agg_path(PlannerInfo *root,
 								RelOptInfo *rel,
 								Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index b571ddec11..45a985e8c1 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -205,8 +205,11 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   Relids required_outer,
 													   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
+extern int find_index_prefix_for_pathkey(List *index_pathkeys,
+										 List *pathkey_positions,
+										 PathKey *target_pathkey);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
-								  ScanDirection scandir);
+								  ScanDirection scandir, List **positions);
 extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
 									  ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 6d048e309c..eda1b0a3b4 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -102,6 +102,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexskipscan           | on
  enable_material                | on
  enable_mergejoin               | on
  enable_nestloop                | on
@@ -113,7 +114,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(18 rows)
+(21 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
-- 
2.26.2

v38-0005-Btree-implementation-of-skipping.patchtext/x-diff; charset=us-asciiDownload

From 666d7b1e219ff28ec4677cca1a3eaa838f61e845 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:45:08 +0200
Subject: [PATCH v38 5/6] Btree implementation of skipping

Implementation of Index Skip Scan for Btree. To make it suitable for
both situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 src/backend/access/nbtree/nbtree.c            |  13 +
 src/backend/access/nbtree/nbtsearch.c         | 506 ++++++++++++-
 src/include/access/nbtree.h                   |   7 +
 src/test/regress/expected/join.out            |   3 +
 src/test/regress/expected/select_distinct.out | 681 ++++++++++++++++++
 src/test/regress/expected/sysviews.out        |   2 +-
 src/test/regress/sql/join.sql                 |   5 +
 src/test/regress/sql/select_distinct.sql      | 298 ++++++++
 8 files changed, 1513 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 2bcc79ac2f..ff8d16a76d 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -123,6 +123,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
+	amroutine->amskip = btskip;
 	amroutine->amcostestimate = btcostestimate;
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
@@ -374,6 +375,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	 */
 	so->currTuples = so->markTuples = NULL;
 
+	so->skipScanKey = NULL;
+
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
@@ -440,6 +443,16 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	_bt_preprocess_array_keys(scan);
 }
 
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction,
+	   ScanDirection indexdir, int prefix)
+{
+	return _bt_skip(scan, direction, indexdir, prefix);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d1177d8772..77e5b91b22 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -45,7 +45,11 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+											Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan,
+										   BTScanInsert key,
+										   Buffer buf);
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -1486,6 +1490,452 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 	return true;
 }
 
+/*
+ *  _bt_skip() -- Skip items that have the same prefix as the most recently
+ * 				  fetched index tuple.
+ *
+ * 		The current position is set so that a subsequent call to _bt_next will
+ * 		fetch the first tuple that differs in the leading 'prefix' keys.
+ *
+ * 		There are four different kinds of skipping (depending on dir and
+ * 		indexdir, that are important to distinguish, especially in the presense
+ * 		of an index condition:
+ *
+ * 		* Advancing forward and reading forward
+ * 			simple scan
+ *
+ * 		* Advancing forward and reading backward
+ * 			scan inside a cursor fetching backward, when skipping is necessary
+ * 			right from the start
+ *
+ * 		* Advancing backward and reading forward
+ * 			scan with order by desc inside a cursor fetching forward, when
+ * 			skipping is necessary right from the start
+ *
+ * 		* Advancing backward and reading backward
+ * 			simple scan with order by desc
+ *
+ *      The current page is searched for the next unique value. If none is found
+ *      we will do a scan from the root in order to find the next page with
+ *      a unique value.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir,
+		 ScanDirection indexdir, int prefix)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTStack stack;
+	Buffer buf;
+	OffsetNumber offnum;
+	BTScanPosItem *currItem;
+	Relation 	 indexRel = scan->indexRelation;
+	bool scanstart = !BTScanPosIsValid(so->currPos);
+
+	/* We want to return tuples, and we need a starting point */
+	Assert(scan->xs_want_itup);
+	Assert(scan->xs_itup);
+
+	if (so->numKilled > 0)
+		_bt_killitems(scan);
+
+	/* If skipScanKey is NULL then we initialize it with _bt_mkscankey */
+	if (so->skipScanKey == NULL)
+	{
+		so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+		so->skipScanKey->keysz = prefix;
+		so->skipScanKey->scantid = NULL;
+	}
+	so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+	_bt_update_skip_scankeys(scan, indexRel);
+
+	/* Check if the next unique key can be found within the current page.
+	 * Since we do not lock the current page between jumps, it's possible
+	 * that it was splitted since the last time we saw it. This is fine in
+	 * case of scanning forward, since page split to the right and we are
+	 * still on the left most page. In case of scanning backwards it's
+	 * possible to loose some pages and we need to remember the previous
+	 * page, and then follow the right link from the current page until we
+	 * find the original one.
+	 *
+	 * Since the whole idea of checking the current page is to protect
+	 * ourselves and make more performant statistic mismatch case when
+	 * there are too many distinct values for jumping, it's not clear if
+	 * the complexity of this solution in case of backward scan is
+	 * justified, so for now just avoid it.
+	 */
+	if (BufferIsValid(so->currPos.buf) && ScanDirectionIsForward(dir))
+	{
+		_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+
+		if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf))
+		{
+			bool keyFound = false;
+
+			offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+			/* Lock the page for SERIALIZABLE transactions */
+			PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+							  scan->xs_snapshot);
+
+			/* We know in which direction to look */
+			_bt_initialize_more_data(so, dir);
+
+			/* Now read the data */
+			keyFound = _bt_readpage(scan, dir, offnum);
+
+			_bt_unlockbuf(indexRel, so->currPos.buf);
+			ReleaseBuffer(so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+
+			if (keyFound)
+			{
+				/* set IndexTuple */
+				currItem = &so->currPos.items[so->currPos.itemIndex];
+				scan->xs_heaptid = currItem->heapTid;
+				scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+				return true;
+			}
+		}
+		else
+			_bt_unlockbuf(indexRel, so->currPos.buf);
+	}
+
+	if (BufferIsValid(so->currPos.buf))
+	{
+		ReleaseBuffer(so->currPos.buf);
+		so->currPos.buf = InvalidBuffer;
+	}
+
+	/*
+	 * We haven't found scan key within the current page, so let's scan from
+	 * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+	 * number
+	 */
+	stack = _bt_search(scan->indexRelation, so->skipScanKey,
+					   &buf, BT_READ, scan->xs_snapshot);
+	_bt_freestack(stack);
+	so->currPos.buf = buf;
+	offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+	/* Lock the page for SERIALIZABLE transactions */
+	PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+					  scan->xs_snapshot);
+
+	/* We know in which direction to look */
+	_bt_initialize_more_data(so, dir);
+
+	/*
+	 * Simplest case is when both directions are forward, when we are already
+	 * at the next distinct key at the beginning of the series (so everything
+	 * else would be done in _bt_readpage)
+	 *
+	 * The case when both directions are backwards is also simple, but we need
+	 * to go one step back, since we need a last element from the previous
+	 * series.
+	 */
+	if ((ScanDirectionIsBackward(dir) && ScanDirectionIsBackward(indexdir)) ||
+		(ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir) & scanstart))
+		 offnum = OffsetNumberPrev(offnum);
+
+	/*
+	 * Andvance backward but read forward. At this moment we are at the next
+	 * distinct key at the beginning of the series. In case if scan just
+	 * started, we can read forward without doing anything else. Otherwise
+	 * find previous distinct key and the beginning of it's series and read
+	 * forward from there. To do so, go back one step, perform binary search
+	 * to find the first item in the series and let _bt_readpage do everything
+	 * else.
+	 */
+	else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir) && !scanstart)
+	{
+		/* Reading forward means we expect to see more data on the right */
+		so->currPos.moreRight = true;
+
+		offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+		/* One step back to find a previous value */
+		_bt_readpage(scan, dir, offnum);
+
+		_bt_unlockbuf(indexRel, so->currPos.buf);
+		if (_bt_next(scan, dir))
+		{
+			_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+			_bt_update_skip_scankeys(scan, indexRel);
+
+			/*
+			 * And now find the last item from the sequence for the
+			 * current, value with the intention do OffsetNumberNext. As a
+			 * result we end up on a first element from the sequence.
+			 */
+			if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf))
+				offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+			else
+			{
+				if (BufferIsValid(so->currPos.buf))
+				{
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_bt_killitems(scan);
+
+					_bt_unlockbuf(indexRel, so->currPos.buf);
+					ReleaseBuffer(so->currPos.buf);
+					so->currPos.buf = InvalidBuffer;
+				}
+
+				stack = _bt_search(scan->indexRelation, so->skipScanKey,
+								   &buf, BT_READ, scan->xs_snapshot);
+				_bt_freestack(stack);
+				so->currPos.buf = buf;
+				offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+			}
+		}
+		else
+		{
+			pfree(so->skipScanKey);
+			so->skipScanKey = NULL;
+			return false;
+		}
+	}
+
+	/*
+	 * Advance forward but read backward. At this moment we are at the next
+	 * distinct key at the beginning of the series. In case if scan just
+	 * started, we can go one step back and read forward without doing
+	 * anything else. Otherwise find the next distinct key and the beginning
+	 * of it's series, go one step back and read backward from there.
+	 *
+	 * An interesting situation can happen if one of distinct keys do not pass
+	 * a corresponding index condition at all. In this case reading backward
+	 * can lead to a previous distinct key being found, creating a loop. To
+	 * avoid that check the value to be returned, and jump one more time if
+	 * it's the same as at the beginning. Note that we do not check visibility
+	 * here, and dead tuples could also lead to the same situation. This has to
+	 * be checked on the caller side.
+	 */
+	else if (ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir) && !scanstart)
+	{
+		IndexTuple 	startItup = CopyIndexTuple(scan->xs_itup);
+		bool 		nextFound = false;
+
+		/* Reading backwards means we expect to see more data on the left */
+		so->currPos.moreLeft = true;
+
+		while (!nextFound)
+		{
+			IndexTuple itup;
+			OffsetNumber jumpOffset;
+			CHECK_FOR_INTERRUPTS();
+
+			/*
+			 * Find a next index tuple to update scan key. It could be at
+			 * the end, so check for max offset
+			 */
+			if (!_bt_readpage(scan, ForwardScanDirection, offnum))
+			{
+				/*
+				 * There's no actually-matching data on this page.  Try to
+				 * advance to the next page. Return false if there's no
+				 * matching data at all.
+				 */
+				_bt_unlockbuf(indexRel, so->currPos.buf);
+				if (!_bt_steppage(scan, dir))
+				{
+					pfree(so->skipScanKey);
+					so->skipScanKey = NULL;
+					return false;
+				}
+				_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+			}
+
+			currItem = &so->currPos.items[so->currPos.firstItem];
+			itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+			scan->xs_itup = itup;
+
+			_bt_update_skip_scankeys(scan, indexRel);
+			if (BufferIsValid(so->currPos.buf))
+			{
+				/* Before leaving current page, deal with any killed items */
+				if (so->numKilled > 0)
+					_bt_killitems(scan);
+
+				_bt_unlockbuf(indexRel, so->currPos.buf);
+				ReleaseBuffer(so->currPos.buf);
+				so->currPos.buf = InvalidBuffer;
+			}
+
+			stack = _bt_search(scan->indexRelation, so->skipScanKey,
+							   &buf, BT_READ, scan->xs_snapshot);
+			_bt_freestack(stack);
+			so->currPos.buf = buf;
+
+			/*
+			 * We need to remember the original offset after the jump,
+			 * since in case of looping this would be the next starting
+			 * point
+			 */
+			jumpOffset = offnum = _bt_binsrch(scan->indexRelation,
+											  so->skipScanKey, buf);
+			offnum = OffsetNumberPrev(offnum);
+
+			if (!_bt_readpage(scan, indexdir, offnum))
+			{
+				/*
+				 * There's no actually-matching data on this page.  Try to
+				 * advance to the next page. Return false if there's no
+				 * matching data at all.
+				 */
+				_bt_unlockbuf(indexRel, so->currPos.buf);
+				if (!_bt_steppage(scan, indexdir))
+				{
+					pfree(so->skipScanKey);
+					so->skipScanKey = NULL;
+					return false;
+				}
+				_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+			}
+
+			currItem = &so->currPos.items[so->currPos.lastItem];
+			itup = CopyIndexTuple((IndexTuple)
+					(so->currTuples + currItem->tupleOffset));
+
+			/*
+			 * To check if we returned the same tuple, try to find a
+			 * startItup on the current page. For that we need to update
+			 * scankey to match the whole tuple and set nextkey to return
+			 * an exact tuple, not the next one. If the tuple we found in
+			 * this way is equal to what we wanted to return, it means we
+			 * are in the loop, return offnum to the original position and
+			 * jump further
+			 *
+			 * Note that to compare tids we need to keep the leaf pinned,
+			 * otherwise there is a danger of vacuum cleaning up relevant
+			 * tuples.
+			 */
+			scan->xs_itup = startItup;
+			_bt_update_skip_scankeys(scan, indexRel);
+
+			so->skipScanKey->keysz = IndexRelationGetNumberOfKeyAttributes(indexRel);
+			so->skipScanKey->nextkey = false;
+
+			if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf))
+			{
+				OffsetNumber maxoff, startOffset;
+				IndexTuple verifiedItup;
+				Page page = BufferGetPage(so->currPos.buf);
+				startOffset = _bt_binsrch(scan->indexRelation,
+										  so->skipScanKey,
+										  so->currPos.buf);
+
+				maxoff = PageGetMaxOffsetNumber(page);
+
+				/* Now read the data */
+				if (_bt_readpage(scan, ForwardScanDirection, startOffset))
+				{
+					ItemPointer resultTids, verifyTids;
+					int nresult = 1,
+						nverify = 1;
+
+					currItem = &so->currPos.items[so->currPos.itemIndex];
+					verifiedItup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+					/*
+					 * We need to keep in mind that tuples we deal with
+					 * could be also posting tuples and represent a list of
+					 * tids.
+					 */
+					if (BTreeTupleIsPosting(verifiedItup))
+					{
+						nverify = BTreeTupleGetNPosting(verifiedItup);
+						verifyTids = BTreeTupleGetPosting(verifiedItup);
+						for (int i = 1; i < nverify; i++)
+							verifyTids[i] = *BTreeTupleGetPostingN(verifiedItup, i);
+					}
+					else
+						verifyTids = &verifiedItup->t_tid;
+
+					if (BTreeTupleIsPosting(itup))
+					{
+						nresult = BTreeTupleGetNPosting(itup);
+						resultTids = BTreeTupleGetPosting(itup);
+						for (int i = 1; i < nresult; i++)
+							resultTids[i] = *BTreeTupleGetPostingN(itup, i);
+					}
+					else
+						resultTids = &itup->t_tid;
+
+					/* One not equal means they're not equal. */
+					for(int i = 0; i < nverify; i++)
+					{
+						for(int j = 0; j < nresult; j++)
+						{
+							if (!ItemPointerEquals(&resultTids[j], &verifyTids[i]))
+							{
+								nextFound = true;
+								break;
+							}
+						}
+					}
+
+					if (!nextFound)
+						offnum = jumpOffset;
+				}
+
+				if ((offnum > maxoff) && (so->currPos.nextPage == P_NONE))
+				{
+					_bt_unlockbuf(indexRel, so->currPos.buf);
+
+					BTScanPosUnpinIfPinned(so->currPos);
+					BTScanPosInvalidate(so->currPos);
+
+					pfree(so->skipScanKey);
+					so->skipScanKey = NULL;
+					return false;
+				}
+			}
+			else
+				/*
+				 * If startItup could be not found within the current page,
+				 * assume we found something new
+				 */
+				nextFound = true;
+
+			/* Return original scankey options */
+			so->skipScanKey->keysz = prefix;
+			so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+		}
+	}
+
+	/* Now read the data */
+	if (!_bt_readpage(scan, indexdir, offnum))
+	{
+		/*
+		 * There's no actually-matching data on this page.  Try to advance to
+		 * the next page.  Return false if there's no matching data at all.
+		 */
+		_bt_unlockbuf(indexRel, so->currPos.buf);
+		if (!_bt_steppage(scan, dir))
+		{
+			pfree(so->skipScanKey);
+			so->skipScanKey = NULL;
+			return false;
+		}
+	}
+	else
+		/* Drop the lock, and maybe the pin, on the current page */
+		_bt_unlockbuf(indexRel, so->currPos.buf);
+
+	/* And set IndexTuple */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_heaptid = currItem->heapTid;
+	scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+	so->currPos.moreLeft = true;
+	so->currPos.moreRight = true;
+
+	return true;
+}
+
 /*
  *	_bt_readpage() -- Load data from current index page into so->currPos
  *
@@ -2482,3 +2932,57 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 	so->numKilled = 0;			/* just paranoia */
 	so->markItemIndex = -1;		/* ditto */
 }
+
+/*
+ * _bt_update_skip_scankeys() -- set up a new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+	TupleDesc		itupdesc;
+	int			indnkeyatts,
+				i;
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	ScanKey			scankeys = so->skipScanKey->scankeys;
+
+	itupdesc = RelationGetDescr(indexRel);
+	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+	for (i = 0; i < indnkeyatts; i++)
+	{
+		Datum datum;
+		bool null;
+		int flags;
+
+		datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+		flags = (null ? SK_ISNULL : 0) |
+				(indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+		scankeys[i].sk_flags = flags;
+		scankeys[i].sk_argument = datum;
+	}
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ *
+ * Scankey nextkey will tell us if we need to find a current key or the next
+ * one, which affects whether or not it's ok to be equal to the page highkey.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, Buffer buf)
+{
+	OffsetNumber low, high;
+	Page page = BufferGetPage(buf);
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	int high_compare = key->nextkey ? 0 : 1;
+
+	low = P_FIRSTDATAKEY(opaque);
+	high = PageGetMaxOffsetNumber(page);
+
+	if (unlikely(high < low))
+		return false;
+
+	return (_bt_compare(scan->indexRelation, key, page, low) > 0 &&
+			_bt_compare(scan->indexRelation, key, page, high) < high_compare);
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index eb284b6042..581bee7b42 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1045,6 +1045,9 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* Work space for _bt_skip */
+	BTScanInsert	skipScanKey;	/* used to control skipping */
+
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -1206,6 +1209,8 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir,
+					 ScanDirection indexdir, int prefix);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 							   Snapshot snapshot);
 
@@ -1230,6 +1235,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
 extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir,
+				   ScanDirection indexdir, int prefix);
 extern bool btproperty(Oid index_oid, int attno,
 					   IndexAMProperty prop, const char *propname,
 					   bool *res, bool *isnull);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 5c7528c029..a52e25b3c3 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4530,6 +4530,8 @@ select d.* from d left join (select * from b group by b.id, b.c_id) s
          ->  Seq Scan on d
 (8 rows)
 
+-- disable index skip scan to prevent it interfering with the plan
+set enable_indexskipscan to off;
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
@@ -4547,6 +4549,7 @@ select d.* from d left join (select distinct * from b) s
          ->  Seq Scan on d
 (9 rows)
 
+set enable_indexskipscan to on;
 -- check join removal works when uniqueness of the join condition is enforced
 -- by a UNION
 explain (costs off)
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index 11c6f50fbf..862e983b55 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -306,3 +306,684 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
  t
 (1 row)
 
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+CREATE INDEX ON distinct_a ((a + 1));
+ANALYZE distinct_a;
+SELECT DISTINCT a FROM distinct_a;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+ a 
+---
+ 1
+(1 row)
+
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Index Only Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+(2 rows)
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- test index skip scan for expressions
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT (a + 1) FROM distinct_a;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Scan using distinct_a_expr_idx on distinct_a
+   Skip scan: true
+(2 rows)
+
+SELECT DISTINCT (a + 1) FROM distinct_a;
+ ?column? 
+----------
+        2
+        3
+        4
+        5
+        6
+(5 rows)
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: true
+   Index Cond: (b = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: true
+   Index Cond: (b = 2)
+(3 rows)
+
+DROP INDEX distinct_a_b_a;
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b 
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+                          QUERY PLAN                          
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: true
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+                              QUERY PLAN                               
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: true
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+ 2 | 1 | 10
+ 3 | 1 | 10
+ 4 | 1 | 10
+ 5 | 1 | 10
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+   Index Cond: (a = 1)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT *
+FROM distinct_a;
+          QUERY PLAN          
+------------------------------
+ HashAggregate
+   Group Key: a, b, c
+   ->  Seq Scan on distinct_a
+(3 rows)
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Unique
+   ->  Index Scan using distinct_a_a_b_idx on distinct_a
+         Index Cond: (b = 2)
+         Filter: (c = 10)
+(4 rows)
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+ a | a 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 3
+ 4 | 4
+ 5 | 5
+(5 rows)
+
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+ a | ?column? 
+---+----------
+ 1 |        1
+ 2 |        1
+ 3 |        1
+ 4 |        1
+ 5 |        1
+(5 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+FETCH FROM c;
+ a 
+---
+ 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a 
+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 |  9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+                                 QUERY PLAN                                 
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+   Skip scan: true
+   Index Cond: ((b >= 1) AND (c = 0))
+(3 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c 
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
+-- test tuple killing
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 1 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 5 | 1 | 1 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 5 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 1 | 1 | 1 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(5 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(5 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- test posting lists
+CREATE TABLE distinct_posting (a int, b int, c int);
+CREATE INDEX ON distinct_posting (a, b, c);
+INSERT INTO distinct_posting
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+INSERT INTO distinct_posting (
+    SELECT 1 as a, 1 as b, 1 AS c
+        FROM generate_series(1,1000) i
+);
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c FROM distinct_posting WHERE c = 2
+    ORDER BY a DESC, b DESC;
+    FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+COMMIT;
+-- test that quals are check for indexability before applied
+CREATE TABLE Indexable_quals (a text, b text, c text);
+CREATE INDEX ON indexable_quals (a, b, c);
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'yyy');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'yyy');
+SELECT DISTINCT ON (a, b)  a, b
+FROM indexable_quals WHERE c LIKE '%y%' AND a LIKE 'a%' AND b = 'b';
+ a  | b 
+----+---
+ a1 | b
+ a2 | b
+(2 rows)
+
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index eda1b0a3b4..9471c0a9d9 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -114,7 +114,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(21 rows)
+(19 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 6a209a27aa..bee79132ad 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1569,11 +1569,16 @@ explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
   on d.a = s.id;
 
+-- disable index skip scan to prevent it interfering with the plan
+set enable_indexskipscan to off;
+
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
   on d.a = s.id;
 
+set enable_indexskipscan to on;
+
 -- check join removal works when uniqueness of the join condition is enforced
 -- by a UNION
 explain (costs off)
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index 33102744eb..f07abbb0f7 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -135,3 +135,301 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
 SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
 SELECT 2 IS NOT DISTINCT FROM null as "no";
 SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+CREATE INDEX ON distinct_a ((a + 1));
+ANALYZE distinct_a;
+
+SELECT DISTINCT a FROM distinct_a;
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- test index skip scan for expressions
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT (a + 1) FROM distinct_a;
+SELECT DISTINCT (a + 1) FROM distinct_a;
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+DROP INDEX distinct_a_b_a;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT *
+FROM distinct_a;
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
+
+-- test tuple killing
+
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- test posting lists
+CREATE TABLE distinct_posting (a int, b int, c int);
+CREATE INDEX ON distinct_posting (a, b, c);
+INSERT INTO distinct_posting
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+INSERT INTO distinct_posting (
+    SELECT 1 as a, 1 as b, 1 AS c
+        FROM generate_series(1,1000) i
+);
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c FROM distinct_posting WHERE c = 2
+    ORDER BY a DESC, b DESC;
+    FETCH ALL FROM c;
+
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+-- test that quals are check for indexability before applied
+CREATE TABLE Indexable_quals (a text, b text, c text);
+CREATE INDEX ON indexable_quals (a, b, c);
+
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'yyy');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'yyy');
+
+SELECT DISTINCT ON (a, b)  a, b
+FROM indexable_quals WHERE c LIKE '%y%' AND a LIKE 'a%' AND b = 'b';
-- 
2.26.2

v38-0006-Index-skip-scan-documentation.patchtext/x-diff; charset=us-asciiDownload

From da583ec267670e80eb2c2420ef7f89700418c966 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:45:21 +0200
Subject: [PATCH v38 6/6] Index skip scan documentation

Covers added GUC and the extenion of index am interface.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 doc/src/sgml/config.sgml  | 15 ++++++++++
 doc/src/sgml/indexam.sgml | 62 +++++++++++++++++++++++++++++++++++++++
 doc/src/sgml/indices.sgml | 23 +++++++++++++++
 3 files changed, 100 insertions(+)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a218d78bef..da916634ab 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4886,6 +4886,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+      <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of index-skip-scan plan
+        types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+        <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-material" xreflabel="enable_material">
       <term><varname>enable_material</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 3442ae816b..0f211df7a1 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -769,6 +769,68 @@ amrestrpos (IndexScanDesc scan);
 
   <para>
 <programlisting>
+bool
+amskip (IndexScanDesc scan,
+        ScanDirection direction,
+        ScanDirection indexdir,
+        bool scanstart,
+        int prefix);
+</programlisting>
+  Skip past all tuples where the first 'prefix' columns have the same value as
+  the last tuple returned in the current scan. The arguments are:
+
+   <variablelist>
+    <varlistentry>
+     <term><parameter>scan</parameter></term>
+     <listitem>
+      <para>
+       Index scan information
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>direction</parameter></term>
+     <listitem>
+      <para>
+       The direction in which data is advancing.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>indexdir</parameter></term>
+     <listitem>
+      <para>
+        The index direction, in which data must be read.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>scanstart</parameter></term>
+     <listitem>
+      <para>
+        Whether or not it is a start of the scan.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>prefix</parameter></term>
+     <listitem>
+      <para>
+        Distinct prefix size.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+
+  </para>
+
+  <para>
+<programlisting>
 Size
 amestimateparallelscan (void);
 </programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 623962d1d8..6589cb80ba 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1295,6 +1295,29 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
    and later will recognize such cases and allow index-only scans to be
    generated, but older versions will not.
   </para>
+
+  <sect2 id="indexes-index-skip-scans">
+    <title>Index Skip Scans</title>
+
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index</primary>
+      <secondary>index-skip scans</secondary>
+    </indexterm>
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index-skip scan</primary>
+    </indexterm>
+
+    <para>
+     When the rows retrieved from an index scan are then deduplicated by
+     eliminating rows matching on a prefix of index keys (e.g. when using
+     <literal>SELECT DISTINCT</literal>), the planner will consider
+     skipping groups of rows with a matching key prefix. When a row with
+     a particular prefix is found, remaining rows with the same key prefix
+     are skipped.  The larger the number of rows with the same key prefix
+     rows (i.e. the lower the number of distinct key prefixes in the index),
+     the more efficient this is.
+    </para>
+  </sect2>
  </sect1>
 
 
-- 
2.26.2

#31

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Dmitry Dolgov (#30)

10 attachment(s)

Re: Index Skip Scan (new UniqueKeys)

Hi,

I took a look at the new patch series, focusing mostly on the uniquekeys
part. It'd be a bit tedious to explain all the review comments here, so
attached is a patch series with a "review" patch for some of the parts.
Most of it is fairly small (corrections to comments etc.), I'll go over
the more serious part so that we can discuss it here. I'll keep it split
per parts of the original patch series.

I suggest looking for XXX and FIXME comments in all the review patches.

0001
----

1) I wonder if the loop in set_append_rel_size should do continue
instead of break, to continue with other (not whole row) attributes?

2) A couple comments that'd deserve clarification.

0002
----

1) A bunch of comment fixes, clarifications, etc. Various missing
comments for functions added. Please review that I got all the details
right, etc.

2) Significant reworks of the README.uniquekeys - I found the original
version rather hard to understand, hopefully this is better. But I'm not
sure I got all the details right, so please review.

3) I see set_append_rel_pathlist uses IS_PARTITIONED_REL to decide
whether we need to generate unique keys. Why not to try doing the same
thing for plain append relations? Can't we handle at least the case with
just a single child relation? (Maybe not, we probably don't know if the
parent is empty in that case. Then perhaps mention that in a comment.)

4) Is there a reason why populate_joinrel_uniquekeys gets executed after
try_partitionwise_join? Does that allow generating additional unique
keys, or something like that?

5) A lot of new comments in uniquekeys.c - most of it was seriously
under-documented. Please check I got all the details right.

6) Doesn't populate_baserel_uniquekeys need to look at collations too,
not just opfamilies?

7) The code does something like this:

if (!ind->unique || !ind->immediate ||
(ind->indpred != NIL && !ind->predOK))
continue;

in a number of places. I suggest we define a nicer macro for that.

8) I'm not sure what happens in populate_baserel_uniquekeys if there are
multiple restrictinfos for the same expression.

9) I've modified a couple places to replace this:

foreach(lc, list)
{
...

if (a)
{
...
}
}

to something like this:

foreach(lc, list)
{
...

if (!a)
continue;

...
}

which I think is easier to read (it reduces the level of nesting etc.).
But I admit it's also a matter of personal taste, to some extent.

10) in populate_partitionedrel_uniquekeys I've also added a return to
the first special-case block, so that the second block does not need to
be in an else.

11) It seems weird having to build a new copy of the IndexOptInfo using
simple_copy_indexinfo_to_parent just to check that the parent/child are
compatible. Seems quite expensive, and the code does it quite often. Why
not to invent a much smaller struct just for this?

12) Shouldn't populate_grouprel_uniquekeys do the checks for one-row and
grouping sets in the opposite order? I mean, with grouping sets it seems
possible that we produce multiple rows from one-row input rel, no?

13) Doesn't populate_joinrel_uniquekeys also deal with JOIN_RIGHT? If
not, maybe explain that in a comment.

14) I find it rather strange that we use innerrel_keeps_unique for both
inner-outer and outer-inner directions. Perhaps the name is a bit
misleading?

15) Why does the first "innerrel_keeps_unique" block care about
JOIN_FULL and the second about JOIN_FULL and JOIN_LEFT?

16) If either of the innerrel_keeps_unique blocks gets executed, doesn't
that mean the _ukey_ctx lists may contain incorrect data? Consider that
initialize_uniquecontext_for_joinrel sets "useful=true" for all indexes,
and it gets updated only in the blocks. So if te blocks do not execute,
this info is wrong, no? If not, why?

17) Isn't the last part of add_combined_uniquekey wrong? It does this:

foreach(lc1, get_exprs_from_uniquekey(..., outer_ukey))
{
foreach(lc2, get_exprs_from_uniquekey(..., inner_ukey))
{
List *exprs = list_concat_copy(lfirst_node(List, lc1),
lfirst_node(List, lc2));

joinrel->uniquekeys = lappend(joinrel->uniquekeys,
makeUniqueKey(exprs, ...));
}
}

That seems to iterate over expressions in the unique keys. But consider
you have inner unique key on (a1,a2) and outer unique key (b1,b2). AFAIK
we should add (a1,a2,b1,b2) for the join, but this seems to add (a1,b1),
(a1,b2), (a2,b1), (a2,b2). Seems bogus?

18) I find it a bit annoying that there are no new regression tests.
Surely we need to test this somehow?

0003
----

Just some comments/whitespace.

0004
----

I wonder why we don't include this in explain TEXT format? Seems it
might make it harder to write regression tests for this? It's easier to
just check that we deduced the right unique key(s) than having to
construct an example where it actually changes the plan.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Introduce-RelOptInfo-notnullattrs-attribute-20210317.patchtext/x-patch; charset=UTF-8; name=0001-Introduce-RelOptInfo-notnullattrs-attribute-20210317.patchDownload

From f6da50e9d2644af769fd43cf53b93b3ef69daf75 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Sun, 3 May 2020 22:37:46 +0800
Subject: [PATCH 01/10] Introduce RelOptInfo->notnullattrs attribute

The notnullattrs is calculated from catalog and run-time query. That
infomation is translated to child relation as well for partitioned
table.
---
 src/backend/optimizer/path/allpaths.c  | 31 ++++++++++++++++++++++++++
 src/backend/optimizer/plan/initsplan.c | 10 +++++++++
 src/backend/optimizer/util/plancat.c   | 10 +++++++++
 src/include/nodes/pathnodes.h          |  2 ++
 4 files changed, 53 insertions(+)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index d73ac562eb..37b4223adb 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -998,6 +998,7 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 		RelOptInfo *childrel;
 		ListCell   *parentvars;
 		ListCell   *childvars;
+		int i = -1;
 
 		/* append_rel_list contains all append rels; ignore others */
 		if (appinfo->parent_relid != parentRTindex)
@@ -1054,6 +1055,36 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 								   (Node *) rel->reltarget->exprs,
 								   1, &appinfo);
 
+		/* Copy notnullattrs. */
+		while ((i = bms_next_member(rel->notnullattrs, i)) > 0)
+		{
+			AttrNumber attno = i + FirstLowInvalidHeapAttributeNumber;
+			AttrNumber child_attno;
+			if (attno == 0)
+			{
+				/* Whole row is not null, so must be same for child */
+				childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+														attno - FirstLowInvalidHeapAttributeNumber);
+				break;
+			}
+			if (attno < 0 )
+				/* no need to translate system column */
+				child_attno = attno;
+			else
+			{
+				Node * node = list_nth(appinfo->translated_vars, attno - 1);
+				if (!IsA(node, Var))
+					/* This may happens at UNION case, like (SELECT a FROM t1 UNION SELECT a + 3
+					 * FROM t2) t and we know t.a is not null
+					 */
+					continue;
+				child_attno = castNode(Var, node)->varattno;
+			}
+
+			childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+													child_attno - FirstLowInvalidHeapAttributeNumber);
+		}
+
 		/*
 		 * We have to make child entries in the EquivalenceClass data
 		 * structures as well.  This is needed either if the parent
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index 02f813cebd..d27167dc76 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -829,6 +829,16 @@ deconstruct_recurse(PlannerInfo *root, Node *jtnode, bool below_outer_join,
 		{
 			Node	   *qual = (Node *) lfirst(l);
 
+			/* Set the not null info now */
+			ListCell	*lc;
+			List		*non_nullable_vars = find_nonnullable_vars(qual);
+			foreach(lc, non_nullable_vars)
+			{
+				Var *var = lfirst_node(Var, lc);
+				RelOptInfo *rel = root->simple_rel_array[var->varno];
+				rel->notnullattrs = bms_add_member(rel->notnullattrs,
+												   var->varattno - FirstLowInvalidHeapAttributeNumber);
+			}
 			distribute_qual_to_rels(root, qual,
 									below_outer_join, JOIN_INNER,
 									root->qual_security_level,
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index c5947fa418..eebabcfccf 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -117,6 +117,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	Relation	relation;
 	bool		hasindex;
 	List	   *indexinfos = NIL;
+	int			i;
 
 	/*
 	 * We need not lock the relation since it was already locked, either by
@@ -480,6 +481,15 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	if (inhparent && relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		set_relation_partition_info(root, rel, relation);
 
+	Assert(rel->notnullattrs == NULL);
+	for(i = 0; i < relation->rd_att->natts; i++)
+	{
+		FormData_pg_attribute attr = relation->rd_att->attrs[i];
+		if (attr.attnotnull)
+			rel->notnullattrs = bms_add_member(rel->notnullattrs,
+											   attr.attnum - FirstLowInvalidHeapAttributeNumber);
+	}
+
 	table_close(relation, NoLock);
 
 	/*
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 86405a274e..0d61f04d27 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -718,6 +718,8 @@ typedef struct RelOptInfo
 	int			rel_parallel_workers;	/* wanted number of parallel workers */
 	uint32		amflags;		/* Bitmask of optional features supported by
 								 * the table AM */
+	/* Not null attrs, start from -FirstLowInvalidHeapAttributeNumber */
+	Bitmapset		*notnullattrs;
 
 	/* Information about foreign tables and foreign joins */
 	Oid			serverid;		/* identifies server for the table or join */
-- 
2.30.2

0002-review-20210317.patchtext/x-patch; charset=UTF-8; name=0002-review-20210317.patchDownload

From b7982eb6896d3f0518acf1d94eb87551d38dc566 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Tue, 16 Mar 2021 17:29:26 +0100
Subject: [PATCH 02/10] review

---
 src/backend/optimizer/path/allpaths.c  | 2 ++
 src/backend/optimizer/plan/initsplan.c | 9 ++++++++-
 src/backend/optimizer/util/plancat.c   | 3 +++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 37b4223adb..fc1a3a68a2 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1065,8 +1065,10 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 				/* Whole row is not null, so must be same for child */
 				childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
 														attno - FirstLowInvalidHeapAttributeNumber);
+				/* XXX shouldn't this be a continue, instead of a break? */
 				break;
 			}
+			/* XXX isn't this missing 'else'? */
 			if (attno < 0 )
 				/* no need to translate system column */
 				child_attno = attno;
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index d27167dc76..4ef876cf7b 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -829,7 +829,14 @@ deconstruct_recurse(PlannerInfo *root, Node *jtnode, bool below_outer_join,
 		{
 			Node	   *qual = (Node *) lfirst(l);
 
-			/* Set the not null info now */
+			/* Set the not null info now
+			 *
+			 * XXX Why now? Why is this the right place to do this? Does it need
+			 * to happen before distribute_qual_to_rels, for example?
+			 *
+			 * XXX Not clear to me why this looks at non-nullable vars? Shouldn't
+			 * we already have the bitmap built from atnums (from get_relation_info)?
+			 */
 			ListCell	*lc;
 			List		*non_nullable_vars = find_nonnullable_vars(qual);
 			foreach(lc, non_nullable_vars)
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index eebabcfccf..cd703e41ba 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -481,6 +481,9 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	if (inhparent && relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		set_relation_partition_info(root, rel, relation);
 
+	/*
+	 * Build information about which attributes are marked as NOT NULL.
+	 */
 	Assert(rel->notnullattrs == NULL);
 	for(i = 0; i < relation->rd_att->natts; i++)
 	{
-- 
2.30.2

0003-Introduce-UniqueKey-attributes-on-RelOptInf-20210317.patchtext/x-patch; charset=UTF-8; name=0003-Introduce-UniqueKey-attributes-on-RelOptInf-20210317.patchDownload

From 1defd871043147efb81fbd1ca2723398f6c7578c Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Mon, 11 May 2020 15:50:52 +0800
Subject: [PATCH 03/10] Introduce UniqueKey attributes on RelOptInfo struct.

UniqueKey is a set of exprs on RelOptInfo which represents the exprs
will be unique on the given RelOptInfo. You can see README.uniquekey
for more information.
---
 src/backend/nodes/copyfuncs.c               |   13 +
 src/backend/nodes/list.c                    |   31 +
 src/backend/nodes/makefuncs.c               |   13 +
 src/backend/nodes/outfuncs.c                |   11 +
 src/backend/nodes/readfuncs.c               |   10 +
 src/backend/optimizer/path/Makefile         |    3 +-
 src/backend/optimizer/path/README.uniquekey |  131 +++
 src/backend/optimizer/path/allpaths.c       |   10 +
 src/backend/optimizer/path/joinpath.c       |    9 +-
 src/backend/optimizer/path/joinrels.c       |    2 +
 src/backend/optimizer/path/pathkeys.c       |    3 +-
 src/backend/optimizer/path/uniquekeys.c     | 1135 +++++++++++++++++++
 src/backend/optimizer/plan/planner.c        |   13 +-
 src/backend/optimizer/prep/prepunion.c      |    2 +
 src/backend/optimizer/util/appendinfo.c     |   44 +
 src/backend/optimizer/util/inherit.c        |   18 +-
 src/include/nodes/makefuncs.h               |    3 +
 src/include/nodes/nodes.h                   |    1 +
 src/include/nodes/pathnodes.h               |   29 +-
 src/include/nodes/pg_list.h                 |   10 +
 src/include/optimizer/appendinfo.h          |    3 +
 src/include/optimizer/optimizer.h           |    2 +
 src/include/optimizer/paths.h               |   43 +
 23 files changed, 1515 insertions(+), 24 deletions(-)
 create mode 100644 src/backend/optimizer/path/README.uniquekey
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index da91cbd2b1..75c1c5e824 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2296,6 +2296,16 @@ _copyPathKey(const PathKey *from)
 	return newnode;
 }
 
+static UniqueKey *
+_copyUniqueKey(const UniqueKey *from)
+{
+	UniqueKey	*newnode = makeNode(UniqueKey);
+
+	COPY_NODE_FIELD(exprs);
+	COPY_SCALAR_FIELD(multi_nullvals);
+
+	return newnode;
+}
 /*
  * _copyRestrictInfo
  */
@@ -5220,6 +5230,9 @@ copyObjectImpl(const void *from)
 		case T_PathKey:
 			retval = _copyPathKey(from);
 			break;
+		case T_UniqueKey:
+			retval = _copyUniqueKey(from);
+			break;
 		case T_RestrictInfo:
 			retval = _copyRestrictInfo(from);
 			break;
diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index dbf6b30233..ca099495a1 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -702,6 +702,37 @@ list_member_oid(const List *list, Oid datum)
 	return false;
 }
 
+/*
+ * return true iff every entry in "members" list is also present
+ * in the "target" list.
+ */
+bool
+list_is_subset(const List *members, const List *target)
+{
+	const ListCell	*lc1, *lc2;
+
+	Assert(IsPointerList(members));
+	Assert(IsPointerList(target));
+	check_list_invariants(members);
+	check_list_invariants(target);
+
+	foreach(lc1, members)
+	{
+		bool found = false;
+		foreach(lc2, target)
+		{
+			if (equal(lfirst(lc1), lfirst(lc2)))
+			{
+				found = true;
+				break;
+			}
+		}
+		if (!found)
+			return false;
+	}
+	return true;
+}
+
 /*
  * Delete the n'th cell (counting from 0) in list.
  *
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 01c110cd2f..40415d0f5b 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -815,3 +815,16 @@ makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols)
 	v->va_cols = va_cols;
 	return v;
 }
+
+
+/*
+ * makeUniqueKey
+ */
+UniqueKey*
+makeUniqueKey(List *exprs, bool multi_nullvals)
+{
+	UniqueKey * ukey = makeNode(UniqueKey);
+	ukey->exprs = exprs;
+	ukey->multi_nullvals = multi_nullvals;
+	return ukey;
+}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 6493a03ff8..44154cde6a 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2456,6 +2456,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+	WRITE_NODE_FIELD(exprs);
+	WRITE_BOOL_FIELD(multi_nullvals);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4198,6 +4206,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 				_outPathKey(str, obj);
 				break;
+			case T_UniqueKey:
+				_outUniqueKey(str, obj);
+				break;
 			case T_PathTarget:
 				_outPathTarget(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index c5e136e9c3..b3e212bf1c 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -492,6 +492,14 @@ _readSetOperationStmt(void)
 	READ_DONE();
 }
 
+static UniqueKey *
+_readUniqueKey(void)
+{
+	READ_LOCALS(UniqueKey);
+	READ_NODE_FIELD(exprs);
+	READ_BOOL_FIELD(multi_nullvals);
+	READ_DONE();
+}
 
 /*
  *	Stuff from primnodes.h.
@@ -2717,6 +2725,8 @@ parseNodeString(void)
 		return_value = _readCommonTableExpr();
 	else if (MATCH("SETOPERATIONSTMT", 16))
 		return_value = _readSetOperationStmt();
+	else if (MATCH("UNIQUEKEY", 9))
+		return_value = _readUniqueKey();
 	else if (MATCH("ALIAS", 5))
 		return_value = _readAlias();
 	else if (MATCH("RANGEVAR", 8))
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..7b9820c25f 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekeys.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/README.uniquekey b/src/backend/optimizer/path/README.uniquekey
new file mode 100644
index 0000000000..5eac761995
--- /dev/null
+++ b/src/backend/optimizer/path/README.uniquekey
@@ -0,0 +1,131 @@
+1. What is UniqueKey?
+We can think UniqueKey is a set of exprs for a RelOptInfo, which we are insure
+that doesn't yields same result among all the rows. The simplest UniqueKey
+format is primary key.
+
+However we define the UnqiueKey as below.
+
+typedef struct UniqueKey
+{
+        NodeTag	type;
+        List	*exprs;
+        bool	multi_nullvals;
+} UniqueKey;
+
+exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+is a special case of UniqueKey, which means there is only one row in that
+relation.it has a stronger semantic than others. like SELECT uk FROM t; uk is
+normal unique key and may have different values. SELECT colx FROM t WHERE uk =
+const.  colx is unique AND we have only 1 value. This field can used for
+innerrel_is_unique. this logic is handled specially in add_uniquekey_for_onerow
+function.
+
+multi_nullvals: true means multi null values may exist in these exprs, so the
+uniqueness is not guaranteed in this case. This field is necessary for
+remove_useless_join & reduce_unique_semijoins where we don't mind these
+duplicated NULL values. It is set to true for 2 cases. One is a unique key
+from a unique index but the related column is nullable. The other one is for
+outer join. see populate_joinrel_uniquekeys for detail.
+
+
+The UniqueKey can be used at the following cases at least:
+1. remove_useless_joins.
+2. reduce_semianti_joins
+3. remove distinct node if distinct clause is unique.
+4. remove aggnode if group by clause is unique.
+5. Index Skip Scan (WIP)
+6. Aggregation Push Down without 2 phase aggregation if the join can't
+   duplicated the aggregated rows. (work in progress feature)
+
+2. How is it maintained?
+
+We have a set of populate_xxx_unqiuekeys functions to maintain the uniquekey on
+various cases. xxx includes baserel, joinrel, partitionedrel, distinctrel,
+groupedrel, unionrel. and we also need to convert the uniquekey from subquery
+to outer relation, which is what convert_subquery_uniquekeys does.
+
+1. The first part is about baserel. We handled 3 cases. suppose we have Unique
+Index on (a, b).
+
+1. SELECT a, b FROM t.  UniqueKey (a, b)
+2. SELECT a FROM t WHERE b = 1;  UniqueKey (a)
+3. SELECT .. FROM t WHERE a = 1 AND b = 1;  UniqueKey (NIL).  onerow case, every
+   column is Unique.
+
+2. The next part is joinrel, this part is most error-prone, we simplified the rules
+like below:
+1. If the relation's UniqueKey can't be duplicated after join,  then is will be
+   still valid for the join rel. The function we used here is
+   innerrel_keeps_unique. The basic idea is innerrel.any_col = outer.uk.
+
+2. If the UnqiueKey can't keep valid via the rule 1, the combination of the
+   UniqueKey from both sides are valid for sure.  We can prove this as: if the
+   unique exprs from rel1 is duplicated by rel2, the duplicated rows must
+   contains different unique exprs from rel2.
+
+More considerations about onerow:
+1. If relation with one row and it can't be duplicated, it is still possible
+   contains mulit_nullvas after outer join.
+2. If the either UniqueKey can be duplicated after join, the can get one row
+   only when both side is one row AND there is no outer join.
+3. Whenever the onerow UniqueKey is not a valid any more, we need to convert one
+   row UniqueKey to normal unique key since we don't store exprs for one-row
+   relation. get_exprs_from_uniquekeys will be used here.
+
+
+More considerations about multi_nullvals after join:
+1. If the original UnqiueKey has multi_nullvals, the final UniqueKey will have
+   mulit_nullvals in any case.
+2. If a unique key doesn't allow mulit_nullvals, after some outer join, it
+   allows some outer join.
+
+
+3. When we comes to subquery, we need to convert_subquery_unqiuekeys just like
+convert_subquery_pathkeys.  Only the UniqueKey insides subquery is referenced as
+a Var in outer relation will be reused. The relationship between the outerrel.Var
+and subquery.exprs is built with outerel->subroot->processed_tlist.
+
+
+4. As for the SRF functions, it will break the uniqueness of uniquekey, However it
+is handled in adjust_paths_for_srfs, which happens after the query_planner. so
+we will maintain the UniqueKey until there and reset it to NIL at that
+places. This can't help on distinct/group by elimination cases but probably help
+in some other cases, like reduce_unqiue_semijoins/remove_useless_joins and it is
+semantic correctly.
+
+
+5. As for inherit table, we first main the UnqiueKey on childrel as well. But for
+partitioned table we need to maintain 2 different kinds of
+UnqiueKey. 1). UniqueKey on the parent relation 2). UniqueKey on child
+relation for partition wise query.
+
+Example:
+CREATE TABLE p (a int not null, b int not null) partition by list (a);
+CREATE TABLE p0 partition of p for values in (1);
+CREATE TABLE p1 partition of p for values in (2);
+
+create unique index p0_b on p0(b);
+create unique index p1_b on p1(b);
+
+Now b is only unique on partition level, so the distinct can't be removed on
+the following cases. SELECT DISTINCT b FROM p;
+
+Another example is SELECT DISTINCT a, b FROM p WHERE a = 1; Since only one
+partition is chosen, the UniqueKey on child relation is same as the UniqueKey on
+parent relation.
+
+Another usage of UniqueKey on partition level is it be helpful for
+partition-wise join.
+
+As for the UniqueKey on parent table level, it comes with 2 different ways,
+1). the UniqueKey is also derived in UniqueKey index, but the index must be same
+in all the related children relations and the unique index must contains
+Partition Key in it. Example:
+
+CREATE UNIQUE INDEX p_ab ON p(a, b);  -- where a is the partition key.
+
+-- Query
+SELECT a, b FROM p; the (a, b) is a UniqueKey of p.
+
+2). If the parent relation has only one childrel, the UniqueKey on childrel is
+ the UniqueKey on parent as well.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index fc1a3a68a2..66bf6f19f7 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -579,6 +579,12 @@ set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	 */
 	check_index_predicates(root, rel);
 
+	/*
+	 * Now that we've marked which partial indexes are suitable, we can now
+	 * build the relation's unique keys.
+	 */
+	populate_baserel_uniquekeys(root, rel, rel->indexlist);
+
 	/* Mark rel with estimated output rows, width, etc */
 	set_baserel_size_estimates(root, rel);
 }
@@ -1299,6 +1305,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 
 	/* Add paths to the append relation. */
 	add_paths_to_append_rel(root, rel, live_childrels);
+	if (IS_PARTITIONED_REL(rel))
+		populate_partitionedrel_uniquekeys(root, rel, live_childrels);
 }
 
 
@@ -2306,6 +2314,8 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 										  pathkeys, required_outer));
 	}
 
+	convert_subquery_uniquekeys(root, rel, sub_final_rel);
+
 	/* If outer rel allows parallelism, do same for partial paths. */
 	if (rel->consider_parallel && bms_is_empty(required_outer))
 	{
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index 57ce97fd53..697ac047fb 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -71,13 +71,6 @@ static void consider_parallel_mergejoin(PlannerInfo *root,
 static void hash_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 								 RelOptInfo *outerrel, RelOptInfo *innerrel,
 								 JoinType jointype, JoinPathExtraData *extra);
-static List *select_mergejoin_clauses(PlannerInfo *root,
-									  RelOptInfo *joinrel,
-									  RelOptInfo *outerrel,
-									  RelOptInfo *innerrel,
-									  List *restrictlist,
-									  JoinType jointype,
-									  bool *mergejoin_allowed);
 static void generate_mergejoin_paths(PlannerInfo *root,
 									 RelOptInfo *joinrel,
 									 RelOptInfo *innerrel,
@@ -1927,7 +1920,7 @@ hash_inner_and_outer(PlannerInfo *root,
  * if it is mergejoinable and involves vars from the two sub-relations
  * currently of interest.
  */
-static List *
+List *
 select_mergejoin_clauses(PlannerInfo *root,
 						 RelOptInfo *joinrel,
 						 RelOptInfo *outerrel,
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 0dbe2ac726..7271f044ec 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -924,6 +924,8 @@ populate_joinrel_with_paths(PlannerInfo *root, RelOptInfo *rel1,
 
 	/* Apply partitionwise join technique, if possible. */
 	try_partitionwise_join(root, rel1, rel2, joinrel, sjinfo, restrictlist);
+
+	populate_joinrel_uniquekeys(root, joinrel, rel1, rel2, restrictlist, sjinfo->jointype);
 }
 
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index bd9a176d7d..139278829b 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -33,7 +33,6 @@ static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
 											 int partkeycol);
-static Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 static bool right_merge_direction(PlannerInfo *root, PathKey *pathkey);
 
 
@@ -1035,7 +1034,7 @@ convert_subquery_pathkeys(PlannerInfo *root, RelOptInfo *rel,
  * We need this to ensure that we don't return pathkeys describing values
  * that are unavailable above the level of the subquery scan.
  */
-static Var *
+Var *
 find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle)
 {
 	ListCell   *lc;
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
new file mode 100644
index 0000000000..77ed2b2eff
--- /dev/null
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -0,0 +1,1135 @@
+/*-------------------------------------------------------------------------
+ *
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/uniquekeys.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/appendinfo.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/tlist.h"
+#include "rewrite/rewriteManip.h"
+
+
+/*
+ * This struct is used to help populate_joinrel_uniquekeys.
+ *
+ * added_to_joinrel is true if a uniquekey (from outerrel or innerrel)
+ * has been added to joinrel.
+ * useful is true if the exprs of the uniquekey still appears in joinrel.
+ */
+typedef struct UniqueKeyContextData
+{
+	UniqueKey	*uniquekey;
+	bool	added_to_joinrel;
+	bool	useful;
+} *UniqueKeyContext;
+
+static List *initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel);
+static bool innerrel_keeps_unique(PlannerInfo *root,
+								  RelOptInfo *outerrel,
+								  RelOptInfo *innerrel,
+								  List *restrictlist,
+								  bool reverse);
+
+static List *get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+										List *const_exprs,
+										List *const_expr_opfamilies,
+										Bitmapset *used_varattrs,
+										bool *useful,
+										bool *multi_nullvals);
+static List *get_exprs_from_uniquekey(PlannerInfo *root,
+									  RelOptInfo *joinrel,
+									  RelOptInfo *rel1,
+									  UniqueKey *ukey);
+static void add_uniquekey_for_onerow(RelOptInfo *rel);
+static bool add_combined_uniquekey(PlannerInfo *root,
+								   RelOptInfo *joinrel,
+								   RelOptInfo *outer_rel,
+								   RelOptInfo *inner_rel,
+								   UniqueKey *outer_ukey,
+								   UniqueKey *inner_ukey,
+								   JoinType jointype);
+
+/* Used for unique indexes checking for partitioned table */
+static bool index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind);
+static IndexOptInfo *simple_copy_indexinfo_to_parent(PlannerInfo *root,
+													 RelOptInfo *parentrel,
+													 IndexOptInfo *from);
+static bool simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2);
+static void adjust_partition_unique_indexlist(PlannerInfo *root,
+											  RelOptInfo *parentrel,
+											  RelOptInfo *childrel,
+											  List **global_unique_index);
+
+/* Helper function for grouped relation and distinct relation. */
+static void add_uniquekey_from_sortgroups(PlannerInfo *root,
+										  RelOptInfo *rel,
+										  List *sortgroups);
+
+/*
+ * populate_baserel_uniquekeys
+ *		Populate 'baserel' uniquekeys list by looking at the rel's unique index
+ * and baserestrictinfo
+ */
+void
+populate_baserel_uniquekeys(PlannerInfo *root,
+							RelOptInfo *baserel,
+							List *indexlist)
+{
+	ListCell *lc;
+	List	*matched_uniq_indexes = NIL;
+
+	/* Attrs appears in rel->reltarget->exprs. */
+	Bitmapset *used_attrs = NULL;
+
+	List	*const_exprs = NIL;
+	List	*expr_opfamilies = NIL;
+
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	foreach(lc, indexlist)
+	{
+		IndexOptInfo *ind = (IndexOptInfo *) lfirst(lc);
+		if (!ind->unique || !ind->immediate ||
+			(ind->indpred != NIL && !ind->predOK))
+			continue;
+		matched_uniq_indexes = lappend(matched_uniq_indexes, ind);
+	}
+
+	if (matched_uniq_indexes  == NIL)
+		return;
+
+	/* Check which attrs is used in baserel->reltarget */
+	pull_varattnos((Node *)baserel->reltarget->exprs, baserel->relid, &used_attrs);
+
+	/* Check which attrno is used at a mergeable const filter */
+	foreach(lc, baserel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+		if (rinfo->mergeopfamilies == NIL)
+			continue;
+
+		if (bms_is_empty(rinfo->left_relids))
+		{
+			const_exprs = lappend(const_exprs, get_rightop(rinfo->clause));
+		}
+		else if (bms_is_empty(rinfo->right_relids))
+		{
+			const_exprs = lappend(const_exprs, get_leftop(rinfo->clause));
+		}
+		else
+			continue;
+
+		expr_opfamilies = lappend(expr_opfamilies, rinfo->mergeopfamilies);
+	}
+
+	foreach(lc, matched_uniq_indexes)
+	{
+		bool	multi_nullvals, useful;
+		List	*exprs = get_exprs_from_uniqueindex(lfirst_node(IndexOptInfo, lc),
+													const_exprs,
+													expr_opfamilies,
+													used_attrs,
+													&useful,
+													&multi_nullvals);
+		if (useful)
+		{
+			if (exprs == NIL)
+			{
+				/* All the columns in Unique Index matched with a restrictinfo */
+				add_uniquekey_for_onerow(baserel);
+				return;
+			}
+			baserel->uniquekeys = lappend(baserel->uniquekeys,
+										  makeUniqueKey(exprs, multi_nullvals));
+		}
+	}
+}
+
+
+/*
+ * populate_partitionedrel_uniquekeys
+ * The UniqueKey on partitionrel comes from 2 cases:
+ * 1). Only one partition is involved in this query, the unique key can be
+ * copied to parent rel from childrel.
+ * 2). There are some unique index which includes partition key and exists
+ * in all the related partitions.
+ * We never mind rule 2 if we hit rule 1.
+ */
+
+void
+populate_partitionedrel_uniquekeys(PlannerInfo *root,
+								   RelOptInfo *rel,
+								   List *childrels)
+{
+	ListCell	*lc;
+	List	*global_uniq_indexlist = NIL;
+	RelOptInfo *childrel;
+	bool is_first = true;
+
+	Assert(IS_PARTITIONED_REL(rel));
+
+	if (childrels == NIL)
+		return;
+
+	/*
+	 * If there is only one partition used in this query, the UniqueKey in childrel is
+	 * still valid in parent level, but we need convert the format from child expr to
+	 * parent expr.
+	 */
+	if (list_length(childrels) == 1)
+	{
+		/* Check for Rule 1 */
+		RelOptInfo *childrel = linitial_node(RelOptInfo, childrels);
+		ListCell	*lc;
+		Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL);
+		if (relation_is_onerow(childrel))
+		{
+			add_uniquekey_for_onerow(rel);
+			return;
+		}
+
+		foreach(lc, childrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			AppendRelInfo *appinfo = find_appinfo_by_child(root, childrel->relid);
+			List *parent_exprs = NIL;
+			bool can_reuse = true;
+			ListCell	*lc2;
+			foreach(lc2, ukey->exprs)
+			{
+				Var *var = (Var *)lfirst(lc2);
+				/*
+				 * If the expr comes from a expression, it is hard to build the expression
+				 * in parent so ignore that case for now.
+				 */
+				if(!IsA(var, Var))
+				{
+					can_reuse = false;
+					break;
+				}
+				/* Convert it to parent var */
+				parent_exprs = lappend(parent_exprs, find_parent_var(appinfo, var));
+			}
+			if (can_reuse)
+				rel->uniquekeys = lappend(rel->uniquekeys,
+										  makeUniqueKey(parent_exprs,
+														ukey->multi_nullvals));
+		}
+	}
+	else
+	{
+		/* Check for rule 2 */
+		childrel = linitial_node(RelOptInfo, childrels);
+		foreach(lc, childrel->indexlist)
+		{
+			IndexOptInfo *ind = lfirst(lc);
+			IndexOptInfo *modified_index;
+			if (!ind->unique || !ind->immediate ||
+				(ind->indpred != NIL && !ind->predOK))
+				continue;
+
+			/*
+			 * During simple_copy_indexinfo_to_parent, we need to convert var from
+			 * child var to parent var, index on expression is too complex to handle.
+			 * so ignore it for now.
+			 */
+			if (ind->indexprs != NIL)
+				continue;
+
+			modified_index = simple_copy_indexinfo_to_parent(root, rel, ind);
+			/*
+			 * If the unique index doesn't contain partkey, then it is unique
+			 * on this partition only, so it is useless for us.
+			 */
+			if (!index_constains_partkey(rel, modified_index))
+				continue;
+
+			global_uniq_indexlist = lappend(global_uniq_indexlist,  modified_index);
+		}
+
+		if (global_uniq_indexlist != NIL)
+		{
+			foreach(lc, childrels)
+			{
+				RelOptInfo *child = lfirst(lc);
+				if (is_first)
+				{
+					is_first = false;
+					continue;
+				}
+				adjust_partition_unique_indexlist(root, rel, child, &global_uniq_indexlist);
+			}
+			/* Now we have a list of unique index which are exactly same on all childrels,
+			 * Set the UniqueKey just like it is non-partition table
+			 */
+			populate_baserel_uniquekeys(root, rel, global_uniq_indexlist);
+		}
+	}
+}
+
+
+/*
+ * populate_distinctrel_uniquekeys
+ */
+void
+populate_distinctrel_uniquekeys(PlannerInfo *root,
+								RelOptInfo *inputrel,
+								RelOptInfo *distinctrel)
+{
+	/* The unique key before the distinct is still valid. */
+	distinctrel->uniquekeys = list_copy(inputrel->uniquekeys);
+	add_uniquekey_from_sortgroups(root, distinctrel, root->parse->distinctClause);
+}
+
+/*
+ * populate_grouprel_uniquekeys
+ */
+void
+populate_grouprel_uniquekeys(PlannerInfo *root,
+							 RelOptInfo *grouprel,
+							 RelOptInfo *inputrel)
+
+{
+	Query *parse = root->parse;
+	bool input_ukey_added = false;
+	ListCell *lc;
+
+	if (relation_is_onerow(inputrel))
+	{
+		add_uniquekey_for_onerow(grouprel);
+		return;
+	}
+	if (parse->groupingSets)
+		return;
+
+	/* A Normal group by without grouping set. */
+	if (parse->groupClause)
+	{
+		/*
+		 * Current even the groupby clause is Unique already, but if query has aggref
+		 * We have to create grouprel still. To keep the UnqiueKey short, we will check
+		 * the UniqueKey of input_rel still valid, if so we reuse it.
+		 */
+		foreach(lc, inputrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(ukey->exprs, grouprel->reltarget->exprs))
+			{
+				grouprel->uniquekeys = lappend(grouprel->uniquekeys,
+											   ukey);
+				input_ukey_added = true;
+			}
+		}
+		if (!input_ukey_added)
+			/*
+			 * group by clause must be a super-set of grouprel->reltarget->exprs except the
+			 * aggregation expr, so if such exprs is unique already, no bother to generate
+			 * new uniquekey for group by exprs.
+			 */
+			add_uniquekey_from_sortgroups(root,
+										  grouprel,
+										  root->parse->groupClause);
+	}
+	else
+		/* It has aggregation but without a group by, so only one row returned */
+		add_uniquekey_for_onerow(grouprel);
+}
+
+/*
+ * simple_copy_uniquekeys
+ * Using a function for the one-line code makes us easy to check where we simply
+ * copied the uniquekey.
+ */
+void
+simple_copy_uniquekeys(RelOptInfo *oldrel,
+					   RelOptInfo *newrel)
+{
+	newrel->uniquekeys = oldrel->uniquekeys;
+}
+
+/*
+ *  populate_unionrel_uniquekeys
+ */
+void
+populate_unionrel_uniquekeys(PlannerInfo *root,
+							  RelOptInfo *unionrel)
+{
+	ListCell	*lc;
+	List	*exprs = NIL;
+
+	Assert(unionrel->uniquekeys == NIL);
+
+	foreach(lc, unionrel->reltarget->exprs)
+	{
+		exprs = lappend(exprs, lfirst(lc));
+	}
+
+	if (exprs == NIL)
+		/* SQL: select union select; is valid, we need to handle it here. */
+		add_uniquekey_for_onerow(unionrel);
+	else
+		unionrel->uniquekeys = lappend(unionrel->uniquekeys,
+									   makeUniqueKey(exprs,false));
+
+}
+
+/*
+ * populate_joinrel_uniquekeys
+ *
+ * populate uniquekeys for joinrel. We will check each relation to see if its
+ * UniqueKey is still valid via innerrel_keeps_unique, if so, we add it to
+ * joinrel.  The multi_nullvals field will be changed to true for some outer
+ * join cases and one-row UniqueKey needs to be converted to normal UniqueKey
+ * for the same case as well.
+ * For the uniquekey in either baserel which can't be unique after join, we still
+ * check to see if combination of UniqueKeys from both side is still useful for us.
+ * if yes, we add it to joinrel as well.
+ */
+void
+populate_joinrel_uniquekeys(PlannerInfo *root, RelOptInfo *joinrel,
+							RelOptInfo *outerrel, RelOptInfo *innerrel,
+							List *restrictlist, JoinType jointype)
+{
+	ListCell *lc, *lc2;
+	List	*clause_list = NIL;
+	List	*outerrel_ukey_ctx;
+	List	*innerrel_ukey_ctx;
+	bool	inner_onerow, outer_onerow;
+	bool	mergejoin_allowed;
+
+	/* Care about the outerrel relation only for SEMI/ANTI join */
+	if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
+	{
+		foreach(lc, outerrel->uniquekeys)
+		{
+			UniqueKey	*uniquekey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(uniquekey->exprs, joinrel->reltarget->exprs))
+				joinrel->uniquekeys = lappend(joinrel->uniquekeys, uniquekey);
+		}
+		return;
+	}
+
+	Assert(jointype == JOIN_LEFT || jointype == JOIN_FULL || jointype == JOIN_INNER);
+
+	/* Fast path */
+	if (innerrel->uniquekeys == NIL || outerrel->uniquekeys == NIL)
+		return;
+
+	inner_onerow = relation_is_onerow(innerrel);
+	outer_onerow = relation_is_onerow(outerrel);
+
+	outerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(outerrel);
+	innerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(innerrel);
+
+	clause_list = select_mergejoin_clauses(root, joinrel, outerrel, innerrel,
+										   restrictlist, jointype,
+										   &mergejoin_allowed);
+
+	if (innerrel_keeps_unique(root, innerrel, outerrel, clause_list, true /* reverse */))
+	{
+		bool outer_impact = jointype == JOIN_FULL;
+		foreach(lc, outerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			/* Outer relation has one row, and the unique key is not duplicated after join,
+			 * the joinrel will still has one row unless the jointype == JOIN_FULL.
+			 */
+			if (outer_onerow && !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (outer_onerow)
+			{
+				/*
+				 * The onerow outerrel becomes multi rows and multi_nullvals
+				 * will be changed to true. We also need to set the exprs correctly since it
+				 * can't be NIL any more.
+				 */
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(root, joinrel, outerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Change multi_nullvals to true due to the full join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs, true));
+				else
+					/* Just reuse it */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	if (innerrel_keeps_unique(root, outerrel, innerrel, clause_list, false))
+	{
+		bool outer_impact = jointype == JOIN_FULL || jointype == JOIN_LEFT;;
+
+		foreach(lc, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			if (inner_onerow &&  !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (inner_onerow)
+			{
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(root, joinrel, innerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Need to change multi_nullvals to true due to the outer join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs,
+																true));
+				else
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	/*
+	 * The combination of the UniqueKey from both sides is unique as well regardless
+	 * of join type, but no bother to add it if its subset has been added to joinrel
+	 * already or it is not useful for the joinrel.
+	 */
+	foreach(lc, outerrel_ukey_ctx)
+	{
+		UniqueKeyContext ctx1 = (UniqueKeyContext) lfirst(lc);
+		if (ctx1->added_to_joinrel || !ctx1->useful)
+			continue;
+		foreach(lc2, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx2 = (UniqueKeyContext) lfirst(lc2);
+			if (ctx2->added_to_joinrel || !ctx2->useful)
+				continue;
+			if (add_combined_uniquekey(root, joinrel, outerrel, innerrel,
+									   ctx1->uniquekey, ctx2->uniquekey,
+									   jointype))
+				/* If we set a onerow UniqueKey to joinrel, we don't need other. */
+				return;
+		}
+	}
+}
+
+
+/*
+ * convert_subquery_uniquekeys
+ *
+ * Covert the UniqueKey in subquery to outer relation.
+ */
+void convert_subquery_uniquekeys(PlannerInfo *root,
+								 RelOptInfo *currel,
+								 RelOptInfo *sub_final_rel)
+{
+	ListCell	*lc;
+
+	if (sub_final_rel->uniquekeys == NIL)
+		return;
+
+	if (relation_is_onerow(sub_final_rel))
+	{
+		add_uniquekey_for_onerow(currel);
+		return;
+	}
+
+	Assert(currel->subroot != NULL);
+
+	foreach(lc, sub_final_rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		ListCell	*lc;
+		List	*exprs = NIL;
+		bool	ukey_useful = true;
+
+		/* One row case is handled above */
+		Assert(ukey->exprs != NIL);
+		foreach(lc, ukey->exprs)
+		{
+			Var *var;
+			TargetEntry *tle = tlist_member(lfirst(lc),
+											currel->subroot->processed_tlist);
+			if (tle == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			var = find_var_for_subquery_tle(currel, tle);
+			if (var == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			exprs = lappend(exprs, var);
+		}
+
+		if (ukey_useful)
+			currel->uniquekeys = lappend(currel->uniquekeys,
+										 makeUniqueKey(exprs,
+													   ukey->multi_nullvals));
+
+	}
+}
+
+/*
+ * innerrel_keeps_unique
+ *
+ * Check if Unique key of the innerrel is valid after join. innerrel's UniqueKey
+ * will be still valid if innerrel's any-column mergeop outrerel's uniquekey
+ * exists in clause_list.
+ *
+ * Note: the clause_list must be a list of mergeable restrictinfo already.
+ */
+static bool
+innerrel_keeps_unique(PlannerInfo *root,
+					  RelOptInfo *outerrel,
+					  RelOptInfo *innerrel,
+					  List *clause_list,
+					  bool reverse)
+{
+	ListCell	*lc, *lc2, *lc3;
+
+	if (outerrel->uniquekeys == NIL || innerrel->uniquekeys == NIL)
+		return false;
+
+	/* Check if there is outerrel's uniquekey in mergeable clause. */
+	foreach(lc, outerrel->uniquekeys)
+	{
+		List	*outer_uq_exprs = lfirst_node(UniqueKey, lc)->exprs;
+		bool clauselist_matchs_all_exprs = true;
+		foreach(lc2, outer_uq_exprs)
+		{
+			Node *outer_uq_expr = lfirst(lc2);
+			bool find_uq_expr_in_clauselist = false;
+			foreach(lc3, clause_list)
+			{
+				RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc3);
+				Node *outer_expr;
+				if (reverse)
+					outer_expr = rinfo->outer_is_left ? get_rightop(rinfo->clause) : get_leftop(rinfo->clause);
+				else
+					outer_expr = rinfo->outer_is_left ? get_leftop(rinfo->clause) : get_rightop(rinfo->clause);
+				if (equal(outer_expr, outer_uq_expr))
+				{
+					find_uq_expr_in_clauselist = true;
+					break;
+				}
+			}
+			if (!find_uq_expr_in_clauselist)
+			{
+				/* No need to check the next exprs in the current uniquekey */
+				clauselist_matchs_all_exprs = false;
+				break;
+			}
+		}
+
+		if (clauselist_matchs_all_exprs)
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * relation_is_onerow
+ * Check if it is a one-row relation by checking UniqueKey.
+ */
+bool
+relation_is_onerow(RelOptInfo *rel)
+{
+	UniqueKey *ukey;
+	if (rel->uniquekeys == NIL)
+		return false;
+	ukey = linitial_node(UniqueKey, rel->uniquekeys);
+	return ukey->exprs == NIL && list_length(rel->uniquekeys) == 1;
+}
+
+/*
+ * relation_has_uniquekeys_for
+ *		Returns true if we have proofs that 'rel' cannot return multiple rows with
+ *		the same values in each of 'exprs'.  Otherwise returns false.
+ */
+bool
+relation_has_uniquekeys_for(PlannerInfo *root, RelOptInfo *rel,
+							List *exprs, bool allow_multinulls)
+{
+	ListCell *lc;
+
+	/*
+	 * For UniqueKey->onerow case, the uniquekey->exprs is empty as well
+	 * so we can't rely on list_is_subset to handle this special cases
+	 */
+	if (exprs == NIL)
+		return false;
+
+	foreach(lc, rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		if (ukey->multi_nullvals && !allow_multinulls)
+			continue;
+		if (list_is_subset(ukey->exprs, exprs))
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * get_exprs_from_uniqueindex
+ *
+ * Return a list of exprs which is unique. set useful to false if this
+ * unique index is not useful for us.
+ */
+static List *
+get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+						   List *const_exprs,
+						   List *const_expr_opfamilies,
+						   Bitmapset *used_varattrs,
+						   bool *useful,
+						   bool *multi_nullvals)
+{
+	List	*exprs = NIL;
+	ListCell	*indexpr_item;
+	int	c = 0;
+
+	*useful = true;
+	*multi_nullvals = false;
+
+	indexpr_item = list_head(unique_index->indexprs);
+	for(c = 0; c < unique_index->ncolumns; c++)
+	{
+		int attr = unique_index->indexkeys[c];
+		Expr *expr;
+		bool	matched_const = false;
+		ListCell	*lc1, *lc2;
+
+		if(attr > 0)
+		{
+			expr = list_nth_node(TargetEntry, unique_index->indextlist, c)->expr;
+		}
+		else if (attr == 0)
+		{
+			/* Expression index */
+			expr = lfirst(indexpr_item);
+			indexpr_item = lnext(unique_index->indexprs, indexpr_item);
+		}
+		else /* attr < 0 */
+		{
+			/* Index on system column is not supported */
+			Assert(false);
+		}
+
+		/*
+		 * Check index_col = Const case with regarding to opfamily checking
+		 * If we can remove the index_col from the final UniqueKey->exprs.
+		 */
+		forboth(lc1, const_exprs, lc2, const_expr_opfamilies)
+		{
+			if (list_member_oid((List *)lfirst(lc2), unique_index->opfamily[c])
+				&& match_index_to_operand((Node *) lfirst(lc1), c, unique_index))
+			{
+				matched_const = true;
+				break;
+			}
+		}
+
+		if (matched_const)
+			continue;
+
+		/* Check if the indexed expr is used in rel */
+		if (attr > 0)
+		{
+			/*
+			 * Normal Indexed column, if the col is not used, then the index is useless
+			 * for uniquekey.
+			 */
+			attr -= FirstLowInvalidHeapAttributeNumber;
+
+			if (!bms_is_member(attr, used_varattrs))
+			{
+				*useful = false;
+				break;
+			}
+		}
+		else if (!list_member(unique_index->rel->reltarget->exprs, expr))
+		{
+			/* Expression index but the expression is not used in rel */
+			*useful = false;
+			break;
+		}
+
+		/* check not null property. */
+		if (attr == 0)
+		{
+			/* We never know if a expression yields null or not */
+			*multi_nullvals = true;
+		}
+		else if (!bms_is_member(attr, unique_index->rel->notnullattrs)
+				 && !bms_is_member(0 - FirstLowInvalidHeapAttributeNumber,
+								   unique_index->rel->notnullattrs))
+		{
+			*multi_nullvals = true;
+		}
+
+		exprs = lappend(exprs, expr);
+	}
+	return exprs;
+}
+
+
+/*
+ * add_uniquekey_for_onerow
+ * If we are sure that the relation only returns one row, then all the columns
+ * are unique. However we don't need to create UniqueKey for every column, we
+ * just set exprs = NIL and overwrites all the other UniqueKey on this RelOptInfo
+ * since this one has strongest semantics.
+ */
+void
+add_uniquekey_for_onerow(RelOptInfo *rel)
+{
+	/*
+	 * We overwrite the previous UniqueKey on purpose since this one has the
+	 * strongest semantic.
+	 */
+	rel->uniquekeys = list_make1(makeUniqueKey(NIL, false));
+}
+
+
+/*
+ * initililze_uniquecontext_for_joinrel
+ * Return a List of UniqueKeyContext for an inputrel
+ */
+static List *
+initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel)
+{
+	List	*res = NIL;
+	ListCell *lc;
+	foreach(lc,  inputrel->uniquekeys)
+	{
+		UniqueKeyContext context;
+		context = palloc(sizeof(struct UniqueKeyContextData));
+		context->uniquekey = lfirst_node(UniqueKey, lc);
+		context->added_to_joinrel = false;
+		context->useful = true;
+		res = lappend(res, context);
+	}
+	return res;
+}
+
+
+/*
+ * get_exprs_from_uniquekey
+ *	Unify the way of get List of exprs from a one-row UniqueKey or
+ * normal UniqueKey. for the onerow case, every expr in rel1 is a valid
+ * UniqueKey. Return a List of exprs.
+ *
+ * rel1: The relation which you want to get the exprs.
+ * ukey: The UniqueKey you want to get the exprs.
+ */
+static List *
+get_exprs_from_uniquekey(PlannerInfo *root, RelOptInfo *joinrel,
+						 RelOptInfo *rel1, UniqueKey *ukey)
+{
+	ListCell *lc;
+	bool onerow = rel1 != NULL && relation_is_onerow(rel1);
+
+	List	*res = NIL;
+	Assert(onerow || ukey);
+	if (onerow)
+	{
+		/* Only cares about the exprs still exist in joinrel */
+		foreach(lc, joinrel->reltarget->exprs)
+		{
+			Bitmapset *relids = pull_varnos(root, lfirst(lc));
+			if (bms_is_subset(relids, rel1->relids))
+			{
+				res = lappend(res, list_make1(lfirst(lc)));
+			}
+		}
+	}
+	else
+	{
+		res = list_make1(ukey->exprs);
+	}
+	return res;
+}
+
+/*
+ * Partitioned table Unique Keys.
+ * The partition table unique key is maintained as:
+ * 1. The index must be unique as usual.
+ * 2. The index must contains partition key.
+ * 3. The index must exist on all the child rel. see simple_indexinfo_equal for
+ *    how we compare it.
+ */
+
+/*
+ * index_constains_partkey
+ * return true if the index contains the partiton key.
+ */
+static bool
+index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind)
+{
+	ListCell	*lc;
+	int	i;
+	Assert(IS_PARTITIONED_REL(partrel));
+	Assert(partrel->part_scheme->partnatts > 0);
+
+	for(i = 0; i < partrel->part_scheme->partnatts; i++)
+	{
+		Node *part_expr = linitial(partrel->partexprs[i]);
+		bool found_in_index = false;
+		foreach(lc, ind->indextlist)
+		{
+			Expr *index_expr = lfirst_node(TargetEntry, lc)->expr;
+			if (equal(index_expr, part_expr))
+			{
+				found_in_index = true;
+				break;
+			}
+		}
+		if (!found_in_index)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * simple_indexinfo_equal
+ *
+ * Used to check if the 2 index is same as each other. The index here
+ * is COPIED from childrel and did some tiny changes(see
+ * simple_copy_indexinfo_to_parent)
+ */
+static bool
+simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2)
+{
+	Size oid_cmp_len = sizeof(Oid) * ind1->ncolumns;
+
+	return ind1->ncolumns == ind2->ncolumns &&
+		ind1->unique == ind2->unique &&
+		memcmp(ind1->indexkeys, ind2->indexkeys, sizeof(int) * ind1->ncolumns) == 0 &&
+		memcmp(ind1->opfamily, ind2->opfamily, oid_cmp_len) == 0 &&
+		memcmp(ind1->opcintype, ind2->opcintype, oid_cmp_len) == 0 &&
+		memcmp(ind1->sortopfamily, ind2->sortopfamily, oid_cmp_len) == 0 &&
+		equal(get_tlist_exprs(ind1->indextlist, true),
+			  get_tlist_exprs(ind2->indextlist, true));
+}
+
+
+/*
+ * The below macros are used for simple_copy_indexinfo_to_parent which is so
+ * customized that I don't want to put it to copyfuncs.c. So copy it here.
+ */
+#define COPY_POINTER_FIELD(fldname, sz) \
+	do { \
+		Size	_size = (sz); \
+		newnode->fldname = palloc(_size); \
+		memcpy(newnode->fldname, from->fldname, _size); \
+	} while (0)
+
+#define COPY_NODE_FIELD(fldname) \
+	(newnode->fldname = copyObjectImpl(from->fldname))
+
+#define COPY_SCALAR_FIELD(fldname) \
+	(newnode->fldname = from->fldname)
+
+
+/*
+ * simple_copy_indexinfo_to_parent (from partition)
+ * Copy the IndexInfo from child relation to parent relation with some modification,
+ * which is used to test:
+ * 1. If the same index exists in all the childrels.
+ * 2. If the parentrel->reltarget/basicrestrict info matches this index.
+ */
+static IndexOptInfo *
+simple_copy_indexinfo_to_parent(PlannerInfo *root,
+								RelOptInfo *parentrel,
+								IndexOptInfo *from)
+{
+	IndexOptInfo *newnode = makeNode(IndexOptInfo);
+	AppendRelInfo *appinfo = find_appinfo_by_child(root, from->rel->relid);
+	ListCell	*lc;
+	int	idx = 0;
+
+	COPY_SCALAR_FIELD(ncolumns);
+	COPY_SCALAR_FIELD(nkeycolumns);
+	COPY_SCALAR_FIELD(unique);
+	COPY_SCALAR_FIELD(immediate);
+	/* We just need to know if it is NIL or not */
+	COPY_SCALAR_FIELD(indpred);
+	COPY_SCALAR_FIELD(predOK);
+	COPY_POINTER_FIELD(indexkeys, from->ncolumns * sizeof(int));
+	COPY_POINTER_FIELD(indexcollations, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opfamily, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opcintype, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(sortopfamily, from->ncolumns * sizeof(Oid));
+	COPY_NODE_FIELD(indextlist);
+
+	/* Convert index exprs on child expr to expr on parent */
+	foreach(lc, newnode->indextlist)
+	{
+		TargetEntry *tle = lfirst_node(TargetEntry, lc);
+		/* Index on expression is ignored */
+		Assert(IsA(tle->expr, Var));
+		tle->expr = (Expr *) find_parent_var(appinfo, (Var *) tle->expr);
+		newnode->indexkeys[idx] = castNode(Var, tle->expr)->varattno;
+		idx++;
+	}
+	newnode->rel = parentrel;
+	return newnode;
+}
+
+/*
+ * adjust_partition_unique_indexlist
+ *
+ * global_unique_indexes: At the beginning, it contains the copy & modified
+ * unique index from the first partition. And then check if each index in it still
+ * exists in the following partitions. If no, remove it. at last, it has an
+ * index list which exists in all the partitions.
+ */
+static void
+adjust_partition_unique_indexlist(PlannerInfo *root,
+								  RelOptInfo *parentrel,
+								  RelOptInfo *childrel,
+								  List **global_unique_indexes)
+{
+	ListCell	*lc, *lc2;
+	foreach(lc, *global_unique_indexes)
+	{
+		IndexOptInfo	*g_ind = lfirst_node(IndexOptInfo, lc);
+		bool found_in_child = false;
+
+		foreach(lc2, childrel->indexlist)
+		{
+			IndexOptInfo   *p_ind = lfirst_node(IndexOptInfo, lc2);
+			IndexOptInfo   *p_ind_copy;
+			if (!p_ind->unique || !p_ind->immediate ||
+				(p_ind->indpred != NIL && !p_ind->predOK))
+				continue;
+			p_ind_copy = simple_copy_indexinfo_to_parent(root, parentrel, p_ind);
+			if (simple_indexinfo_equal(p_ind_copy, g_ind))
+			{
+				found_in_child = true;
+				break;
+			}
+		}
+		if (!found_in_child)
+			/* The index doesn't exist in childrel, remove it from global_unique_indexes */
+			*global_unique_indexes = foreach_delete_current(*global_unique_indexes, lc);
+	}
+}
+
+/* Helper function for groupres/distinctrel */
+static void
+add_uniquekey_from_sortgroups(PlannerInfo *root, RelOptInfo *rel, List *sortgroups)
+{
+	Query *parse = root->parse;
+	List	*exprs;
+
+	/*
+	 * XXX: If there are some vars which is not in current levelsup, the semantic is
+	 * imprecise, should we avoid it or not? levelsup = 1 is just a demo, maybe we need to
+	 * check every level other than 0, if so, looks we have to write another
+	 * pull_var_walker.
+	 */
+	List	*upper_vars = pull_vars_of_level((Node*)sortgroups, 1);
+
+	if (upper_vars != NIL)
+		return;
+
+	exprs = get_sortgrouplist_exprs(sortgroups, parse->targetList);
+	rel->uniquekeys = lappend(rel->uniquekeys,
+							  makeUniqueKey(exprs,
+											false /* sortgroupclause can't be multi_nullvals */));
+}
+
+
+/*
+ * add_combined_uniquekey
+ * The combination of both UniqueKeys is a valid UniqueKey for joinrel no matter
+ * the jointype.
+ */
+bool
+add_combined_uniquekey(PlannerInfo *root,
+					   RelOptInfo *joinrel,
+					   RelOptInfo *outer_rel,
+					   RelOptInfo *inner_rel,
+					   UniqueKey *outer_ukey,
+					   UniqueKey *inner_ukey,
+					   JoinType jointype)
+{
+
+	ListCell	*lc1, *lc2;
+
+	/* Either side has multi_nullvals or we have outer join,
+	 * the combined UniqueKey has multi_nullvals */
+	bool multi_nullvals = outer_ukey->multi_nullvals ||
+		inner_ukey->multi_nullvals || IS_OUTER_JOIN(jointype);
+
+	/* The only case we can get onerow joinrel after join */
+	if  (relation_is_onerow(outer_rel)
+		 && relation_is_onerow(inner_rel)
+		 && jointype == JOIN_INNER)
+	{
+		add_uniquekey_for_onerow(joinrel);
+		return true;
+	}
+
+	foreach(lc1, get_exprs_from_uniquekey(root, joinrel, outer_rel, outer_ukey))
+	{
+		foreach(lc2, get_exprs_from_uniquekey(root, joinrel, inner_rel, inner_ukey))
+		{
+			List *exprs = list_concat_copy(lfirst_node(List, lc1), lfirst_node(List, lc2));
+			joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+										  makeUniqueKey(exprs,
+														multi_nullvals));
+		}
+	}
+	return false;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 424d25cbd5..8d8e493f5c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2387,6 +2387,8 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 		add_path(final_rel, path);
 	}
 
+	simple_copy_uniquekeys(current_rel, final_rel);
+
 	/*
 	 * Generate partial paths for final_rel, too, if outer query levels might
 	 * be able to make use of them.
@@ -3899,6 +3901,8 @@ create_grouping_paths(PlannerInfo *root,
 	}
 
 	set_cheapest(grouped_rel);
+
+	populate_grouprel_uniquekeys(root, grouped_rel, input_rel);
 	return grouped_rel;
 }
 
@@ -4621,7 +4625,7 @@ create_window_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(window_rel);
-
+	simple_copy_uniquekeys(input_rel, window_rel);
 	return window_rel;
 }
 
@@ -4935,7 +4939,7 @@ create_distinct_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(distinct_rel);
-
+	populate_distinctrel_uniquekeys(root, input_rel, distinct_rel);
 	return distinct_rel;
 }
 
@@ -5196,6 +5200,8 @@ create_ordered_paths(PlannerInfo *root,
 	 */
 	Assert(ordered_rel->pathlist != NIL);
 
+	simple_copy_uniquekeys(input_rel, ordered_rel);
+
 	return ordered_rel;
 }
 
@@ -6073,6 +6079,9 @@ adjust_paths_for_srfs(PlannerInfo *root, RelOptInfo *rel,
 	if (list_length(targets) == 1)
 		return;
 
+	/* UniqueKey is not valid after handling the SRF. */
+	rel->uniquekeys = NIL;
+
 	/*
 	 * Stack SRF-evaluation nodes atop each path for the rel.
 	 *
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index becdcbb872..b7626545bf 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -689,6 +689,8 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/* Undo effects of possibly forcing tuple_fraction to 0 */
 	root->tuple_fraction = save_fraction;
 
+	/* Add the UniqueKeys */
+	populate_unionrel_uniquekeys(root, result_rel);
 	return result_rel;
 }
 
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index 86922a273c..6817c9c787 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -746,3 +746,47 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
 	}
 	return appinfos;
 }
+
+/*
+ * find_appinfo_by_child
+ *
+ */
+AppendRelInfo *
+find_appinfo_by_child(PlannerInfo *root, Index child_index)
+{
+	ListCell	*lc;
+	foreach(lc, root->append_rel_list)
+	{
+		AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
+		if (appinfo->child_relid == child_index)
+			return appinfo;
+	}
+	elog(ERROR, "parent relation cant be found");
+	return NULL;
+}
+
+/*
+ * find_parent_var
+ *
+ */
+Var *
+find_parent_var(AppendRelInfo *appinfo, Var *child_var)
+{
+	ListCell	*lc;
+	Var	*res = NULL;
+	Index attno = 1;
+	foreach(lc, appinfo->translated_vars)
+	{
+		Node *child_node = lfirst(lc);
+		if (equal(child_node, child_var))
+		{
+			res = copyObject(child_var);
+			res->varattno = attno;
+			res->varno = appinfo->parent_relid;
+		}
+		attno++;
+	}
+	if (res == NULL)
+		elog(ERROR, "parent var can't be found.");
+	return res;
+}
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index be1c9ddd96..3eec1f4d74 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -736,6 +736,7 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 		{
 			Node	   *onecq = (Node *) lfirst(lc2);
 			bool		pseudoconstant;
+			RestrictInfo	*child_rinfo;
 
 			/* check for pseudoconstant (no Vars or volatile functions) */
 			pseudoconstant =
@@ -747,14 +748,15 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 				root->hasPseudoConstantQuals = true;
 			}
 			/* reconstitute RestrictInfo with appropriate properties */
-			childquals = lappend(childquals,
-								 make_restrictinfo(root,
-												   (Expr *) onecq,
-												   rinfo->is_pushed_down,
-												   rinfo->outerjoin_delayed,
-												   pseudoconstant,
-												   rinfo->security_level,
-												   NULL, NULL, NULL));
+			child_rinfo =  make_restrictinfo(root,
+											 (Expr *) onecq,
+											 rinfo->is_pushed_down,
+											 rinfo->outerjoin_delayed,
+											 pseudoconstant,
+											 rinfo->security_level,
+											 NULL, NULL, NULL);
+			child_rinfo->mergeopfamilies = rinfo->mergeopfamilies;
+			childquals = lappend(childquals, child_rinfo);
 			/* track minimum security level among child quals */
 			cq_min_security = Min(cq_min_security, rinfo->security_level);
 		}
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 48a7ebfe45..4fe1824eb0 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -16,6 +16,7 @@
 
 #include "nodes/execnodes.h"
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 
 extern A_Expr *makeA_Expr(A_Expr_Kind kind, List *name,
@@ -106,4 +107,6 @@ extern GroupingSet *makeGroupingSet(GroupingSetKind kind, List *content, int loc
 
 extern VacuumRelation *makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols);
 
+extern UniqueKey* makeUniqueKey(List *exprs, bool multi_nullvals);
+
 #endif							/* MAKEFUNC_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index e22df890ef..3320273ac1 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -265,6 +265,7 @@ typedef enum NodeTag
 	T_EquivalenceMember,
 	T_PathKey,
 	T_PathTarget,
+	T_UniqueKey,
 	T_RestrictInfo,
 	T_IndexClause,
 	T_PlaceHolderVar,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0d61f04d27..80561b1003 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -739,6 +739,7 @@ typedef struct RelOptInfo
 	QualCost	baserestrictcost;	/* cost of evaluating the above */
 	Index		baserestrict_min_security;	/* min security_level found in
 											 * baserestrictinfo */
+	List	   *uniquekeys;		/* List of UniqueKey */
 	List	   *joininfo;		/* RestrictInfo structures for join clauses
 								 * involving this rel */
 	bool		has_eclass_joins;	/* T means joininfo is incomplete */
@@ -1059,6 +1060,28 @@ typedef struct PathKey
 } PathKey;
 
 
+/*
+ * UniqueKey
+ *
+ * Represents the unique properties held by a RelOptInfo.
+ *
+ * exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+ * is a special case of UniqueKey, which means there is only 1 row in that
+ * relation.
+ * multi_nullvals: true means multi null values may exist in these exprs, so the
+ * uniqueness is not guaranteed in this case. This field is necessary for
+ * remove_useless_join & reduce_unique_semijoins where we don't mind these
+ * duplicated NULL values. It is set to true for 2 cases. One is a unique key
+ * from a unique index but the related column is nullable. The other one is for
+ * outer join. see populate_joinrel_uniquekeys for detail.
+ */
+typedef struct UniqueKey
+{
+	NodeTag		type;
+	List	   *exprs;
+	bool		multi_nullvals;
+} UniqueKey;
+
 /*
  * PathTarget
  *
@@ -2496,7 +2519,7 @@ typedef enum
  *
  * flags indicating what kinds of grouping are possible.
  * partial_costs_set is true if the agg_partial_costs and agg_final_costs
- * 		have been initialized.
+ *		have been initialized.
  * agg_partial_costs gives partial aggregation costs.
  * agg_final_costs gives finalization costs.
  * target_parallel_safe is true if target is parallel safe.
@@ -2526,8 +2549,8 @@ typedef struct
  * limit_tuples is an estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate.
  * count_est and offset_est are the estimated values of the LIMIT and OFFSET
- * 		expressions computed by preprocess_limit() (see comments for
- * 		preprocess_limit() for more information).
+ *		expressions computed by preprocess_limit() (see comments for
+ *		preprocess_limit() for more information).
  */
 typedef struct
 {
diff --git a/src/include/nodes/pg_list.h b/src/include/nodes/pg_list.h
index 404e03f132..85c65e910c 100644
--- a/src/include/nodes/pg_list.h
+++ b/src/include/nodes/pg_list.h
@@ -566,6 +566,16 @@ extern pg_nodiscard List *list_delete_first(List *list);
 extern pg_nodiscard List *list_delete_last(List *list);
 extern pg_nodiscard List *list_delete_nth_cell(List *list, int n);
 extern pg_nodiscard List *list_delete_cell(List *list, ListCell *cell);
+extern bool list_is_subset(const List *members, const List *target);
+
+extern List *list_delete(List *list, void *datum);
+extern List *list_delete_ptr(List *list, void *datum);
+extern List *list_delete_int(List *list, int datum);
+extern List *list_delete_oid(List *list, Oid datum);
+extern List *list_delete_first(List *list);
+extern List *list_delete_last(List *list);
+extern List *list_delete_nth_cell(List *list, int n);
+extern List *list_delete_cell(List *list, ListCell *cell);
 
 extern List *list_union(const List *list1, const List *list2);
 extern List *list_union_ptr(const List *list1, const List *list2);
diff --git a/src/include/optimizer/appendinfo.h b/src/include/optimizer/appendinfo.h
index 4cbf8c26cc..43b8d05f69 100644
--- a/src/include/optimizer/appendinfo.h
+++ b/src/include/optimizer/appendinfo.h
@@ -32,4 +32,7 @@ extern Relids adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
 extern AppendRelInfo **find_appinfos_by_relids(PlannerInfo *root,
 											   Relids relids, int *nappinfos);
 
+extern AppendRelInfo *find_appinfo_by_child(PlannerInfo *root, Index child_index);
+extern Var *find_parent_var(AppendRelInfo *appinfo, Var *child_var);
+
 #endif							/* APPENDINFO_H */
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index d587952b7d..843aafa51e 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -23,6 +23,7 @@
 #define OPTIMIZER_H
 
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 /* Test if an expression node represents a SRF call.  Beware multiple eval! */
 #define IS_SRF_CALL(node) \
@@ -168,6 +169,7 @@ extern TargetEntry *get_sortgroupref_tle(Index sortref,
 										 List *targetList);
 extern TargetEntry *get_sortgroupclause_tle(SortGroupClause *sgClause,
 											List *targetList);
+extern Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 extern Node *get_sortgroupclause_expr(SortGroupClause *sgClause,
 									  List *targetList);
 extern List *get_sortgrouplist_exprs(List *sgClauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 035d3e1206..1adf99c9ee 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -247,5 +247,48 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 									   int strategy, bool nulls_first);
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 									List *live_childrels);
+extern List *select_mergejoin_clauses(PlannerInfo *root,
+									  RelOptInfo *joinrel,
+									  RelOptInfo *outerrel,
+									  RelOptInfo *innerrel,
+									  List *restrictlist,
+									  JoinType jointype,
+									  bool *mergejoin_allowed);
+
+/*
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ */
+extern void populate_baserel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *baserel,
+										List* unique_index_list);
+extern void populate_partitionedrel_uniquekeys(PlannerInfo *root,
+												RelOptInfo *rel,
+												List *childrels);
+extern void populate_distinctrel_uniquekeys(PlannerInfo *root,
+											RelOptInfo *inputrel,
+											RelOptInfo *distinctrel);
+extern void populate_grouprel_uniquekeys(PlannerInfo *root,
+										 RelOptInfo *grouprel,
+										 RelOptInfo *inputrel);
+extern void populate_unionrel_uniquekeys(PlannerInfo *root,
+										  RelOptInfo *unionrel);
+extern void simple_copy_uniquekeys(RelOptInfo *oldrel,
+								   RelOptInfo *newrel);
+extern void convert_subquery_uniquekeys(PlannerInfo *root,
+										RelOptInfo *currel,
+										RelOptInfo *sub_final_rel);
+extern void populate_joinrel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *joinrel,
+										RelOptInfo *rel1,
+										RelOptInfo *rel2,
+										List *restrictlist,
+										JoinType jointype);
+
+extern bool relation_has_uniquekeys_for(PlannerInfo *root,
+										RelOptInfo *rel,
+										List *exprs,
+										bool allow_multinulls);
+extern bool relation_is_onerow(RelOptInfo *rel);
 
 #endif							/* PATHS_H */
-- 
2.30.2

0004-review-20210317.patchtext/x-patch; charset=UTF-8; name=0004-review-20210317.patchDownload

From 78dc78df3a68c091d3e1bec79945128f8dad1ee9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Tue, 16 Mar 2021 19:26:40 +0100
Subject: [PATCH 04/10] review

---
 src/backend/nodes/copyfuncs.c               |   4 +
 src/backend/nodes/makefuncs.c               |   1 -
 src/backend/nodes/outfuncs.c                |   1 +
 src/backend/nodes/readfuncs.c               |   2 +
 src/backend/optimizer/path/README.uniquekey | 285 +++---
 src/backend/optimizer/path/allpaths.c       |  10 +-
 src/backend/optimizer/path/joinrels.c       |   7 +
 src/backend/optimizer/path/uniquekeys.c     | 906 ++++++++++++++------
 src/backend/optimizer/plan/planner.c        |  10 +
 src/backend/optimizer/prep/prepunion.c      |   1 +
 src/backend/optimizer/util/inherit.c        |   1 +
 11 files changed, 863 insertions(+), 365 deletions(-)

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 75c1c5e824..9d832ddc03 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2296,6 +2296,9 @@ _copyPathKey(const PathKey *from)
 	return newnode;
 }
 
+/*
+ * _copyUniqueKey
+ */
 static UniqueKey *
 _copyUniqueKey(const UniqueKey *from)
 {
@@ -2306,6 +2309,7 @@ _copyUniqueKey(const UniqueKey *from)
 
 	return newnode;
 }
+
 /*
  * _copyRestrictInfo
  */
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 40415d0f5b..e156c9cdf8 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -816,7 +816,6 @@ makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols)
 	return v;
 }
 
-
 /*
  * makeUniqueKey
  */
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 44154cde6a..13905e6037 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2460,6 +2460,7 @@ static void
 _outUniqueKey(StringInfo str, const UniqueKey *node)
 {
 	WRITE_NODE_TYPE("UNIQUEKEY");
+
 	WRITE_NODE_FIELD(exprs);
 	WRITE_BOOL_FIELD(multi_nullvals);
 }
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index b3e212bf1c..8830c8df99 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -496,8 +496,10 @@ static UniqueKey *
 _readUniqueKey(void)
 {
 	READ_LOCALS(UniqueKey);
+
 	READ_NODE_FIELD(exprs);
 	READ_BOOL_FIELD(multi_nullvals);
+
 	READ_DONE();
 }
 
diff --git a/src/backend/optimizer/path/README.uniquekey b/src/backend/optimizer/path/README.uniquekey
index 5eac761995..31cdb5ed65 100644
--- a/src/backend/optimizer/path/README.uniquekey
+++ b/src/backend/optimizer/path/README.uniquekey
@@ -1,131 +1,208 @@
-1. What is UniqueKey?
-We can think UniqueKey is a set of exprs for a RelOptInfo, which we are insure
-that doesn't yields same result among all the rows. The simplest UniqueKey
-format is primary key.
+review comments:
+XXX Maybe move this to src/backend/optimizer/README.uniquekey?
+XXX multi_nullvals name seems a bit weird
+XXX no info about  populate_distinctrel_uniquekeys, populate_grouprel_uniquekeys, populate_unionrel_uniquekeys
+-----
+src/backend/optimizer/path/README.uniquekey
+
+UniqueKey
+=========
+
+UniqueKey is a set of exprs for a RelOptInfo, which are known to have unique
+values on all the rows in the relation.  A trivial example is a primary key
+defined on a relation - each attributes of the constraint is a unique key.
+
+We can use this knowledge to perform optimization in a number of places. Some
+of the optimizations are fairly obvious, others are less so:
+
+1. remove DISTINCT node if the clause is unique
+2. remove aggregation if group by clause is unique
+3. remove_useless_joins
+4. reduce_semianti_joins
+5. Index Skip Scan (WIP)
+6. Aggregation Push-Down without 2-phase aggregation if the join can't
+   duplicate the aggregated rows. (WIP)
+
 
-However we define the UnqiueKey as below.
+UniqueKey struct
+----------------
 
-typedef struct UniqueKey
-{
+A UnqiueKey is represented by the following struct:
+
+    typedef struct UniqueKey
+    {
         NodeTag	type;
         List	*exprs;
         bool	multi_nullvals;
-} UniqueKey;
-
-exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
-is a special case of UniqueKey, which means there is only one row in that
-relation.it has a stronger semantic than others. like SELECT uk FROM t; uk is
-normal unique key and may have different values. SELECT colx FROM t WHERE uk =
-const.  colx is unique AND we have only 1 value. This field can used for
-innerrel_is_unique. this logic is handled specially in add_uniquekey_for_onerow
-function.
-
-multi_nullvals: true means multi null values may exist in these exprs, so the
-uniqueness is not guaranteed in this case. This field is necessary for
-remove_useless_join & reduce_unique_semijoins where we don't mind these
-duplicated NULL values. It is set to true for 2 cases. One is a unique key
-from a unique index but the related column is nullable. The other one is for
-outer join. see populate_joinrel_uniquekeys for detail.
-
-
-The UniqueKey can be used at the following cases at least:
-1. remove_useless_joins.
-2. reduce_semianti_joins
-3. remove distinct node if distinct clause is unique.
-4. remove aggnode if group by clause is unique.
-5. Index Skip Scan (WIP)
-6. Aggregation Push Down without 2 phase aggregation if the join can't
-   duplicated the aggregated rows. (work in progress feature)
-
-2. How is it maintained?
-
-We have a set of populate_xxx_unqiuekeys functions to maintain the uniquekey on
-various cases. xxx includes baserel, joinrel, partitionedrel, distinctrel,
-groupedrel, unionrel. and we also need to convert the uniquekey from subquery
-to outer relation, which is what convert_subquery_uniquekeys does.
-
-1. The first part is about baserel. We handled 3 cases. suppose we have Unique
-Index on (a, b).
-
-1. SELECT a, b FROM t.  UniqueKey (a, b)
-2. SELECT a FROM t WHERE b = 1;  UniqueKey (a)
-3. SELECT .. FROM t WHERE a = 1 AND b = 1;  UniqueKey (NIL).  onerow case, every
-   column is Unique.
-
-2. The next part is joinrel, this part is most error-prone, we simplified the rules
-like below:
-1. If the relation's UniqueKey can't be duplicated after join,  then is will be
-   still valid for the join rel. The function we used here is
-   innerrel_keeps_unique. The basic idea is innerrel.any_col = outer.uk.
-
-2. If the UnqiueKey can't keep valid via the rule 1, the combination of the
-   UniqueKey from both sides are valid for sure.  We can prove this as: if the
-   unique exprs from rel1 is duplicated by rel2, the duplicated rows must
-   contains different unique exprs from rel2.
-
-More considerations about onerow:
-1. If relation with one row and it can't be duplicated, it is still possible
-   contains mulit_nullvas after outer join.
-2. If the either UniqueKey can be duplicated after join, the can get one row
-   only when both side is one row AND there is no outer join.
-3. Whenever the onerow UniqueKey is not a valid any more, we need to convert one
-   row UniqueKey to normal unique key since we don't store exprs for one-row
-   relation. get_exprs_from_uniquekeys will be used here.
-
-
-More considerations about multi_nullvals after join:
+    } UniqueKey;
+
+exprs is a list of exprs which are know to be unique on current RelOptInfo.
+
+exprs = NIL is a special case, meaning there is only one row in the relation.
+This has has a stronger semantic than others.  Consider for example
+
+    SELECT uk FROM t
+
+where 'uk' is a unique key. This guarantees uniqueness, but there may be mamy
+rows in the relation.  On the other hand, consider this query
+
+    SELECT colx FROM t WHERE uk = const
+
+In this case we know there's only a single matching row (thanks to a condition
+on the unique key), which in turn guarantees uniqueness of the colx value, even
+if there is no constraint on the column itself.
+
+This knowledge is used in innerrel_is_unique, and is handled as a special case
+in add_uniquekey_for_onerow.
+
+
+The multi_nullvals field tracks whether the expressions may contain multiple
+NULL values.  This can happen for example when the unique key is derived from
+a unique index with nullable columns, or because of outer joins (which may add
+NULL values to a known-unique list - see populate_joinrel_uniquekeys).
+
+In this case uniqueness is not guaranteed, but we can still use the information
+in places places where NULL values are harmless - when removing useless joins,
+reducing semijoins, and so on.
+
+
+How is it maintained?
+---------------------
+
+Deducing the unique keys depends on the type of the relation - for each case
+there's a separate "populate" function:
+
+
+populate_baserel_uniquekeys
+---------------------------
+
+There are three cases, all assuming there's a unique index (e.g. on (a,b)):
+
+1. SELECT a, b FROM t                      => UniqueKey (a, b)
+2. SELECT a FROM t WHERE b = 1             => UniqueKey (a)
+3. SELECT .. FROM t WHERE a = 1 AND b = 1; => UniqueKey (NIL)
+
+The last query is the "one row" case, in which case every column is Unique.
+
+
+populate_joinrel_uniquekeys
+---------------------------
+
+For joins, deducing the unique keys may be fairly complex and error-prone.
+We've simplified the rules like this:
+
+1. If the UniqueKey on an input relation can't be duplicated by the join, then
+it will be valid for the join rel. A typical example is a join like this:
+
+    inner_rel.any_col = outer_rel.unique_key
+
+The function used to detect this is innerrel_keeps_unique.
+
+2. Any combination of unique keys on each side of the join is a unique key
+for the join relation.  This can be proved by contradiction - assume we have
+unique key on either side of the join - uk1 and uk2. If the values in uk1 get
+duplicated by the join with uk2 (by matching the row to multiple rows), the
+duplicated rows must have different values in the uk2.
+
+We can also leverage information about the "one row" case:
+
+1. If one of the input relations is known to have a single row, and the join
+can't duplicate the row (e.g. semi/anti join), we can keep the unique keys.
+It may however contain multi_nullvals after an outer join.
+
+XXX Not sure I understand the original logic/wording :-(
+
+2. If either UniqueKey can be duplicated after a join, there can be only one
+row only when both sides are "one row" AND there is no outer join.
+
+XXX Why the restriction on not allowing outer joins?
+
+3. Whenever the one row UniqueKey is not a valid any more, we need to convert
+UniqueKey to normal unique key since we don't store exprs for one-row relation.
+This is done by get_exprs_from_uniquekeys.
+
+The join case needs to be careful about multi_nullvals too:
+
 1. If the original UnqiueKey has multi_nullvals, the final UniqueKey will have
-   mulit_nullvals in any case.
-2. If a unique key doesn't allow mulit_nullvals, after some outer join, it
-   allows some outer join.
+mulit_nullvals in any case too.
+
+2. If the original unique key doesn't allow multi_nullvals, the unique key for
+the join relation may allow multi_nullvals after an outer join.
+
+
+subqueries
+----------
+
+It's necessary to "translate" unique keys between a subquery and the outer rels,
+which is what convert_subquery_uniquekeys does.  This does almost exactly what
+convert_subquery_pathkeys does for pathkeys.  It keeps only unique keys matching
+Vars in the outer relation.  The relationship between outerrel.Var and
+subquery.exprs is built from outerel->subroot->processed_tlist.
 
 
-3. When we comes to subquery, we need to convert_subquery_unqiuekeys just like
-convert_subquery_pathkeys.  Only the UniqueKey insides subquery is referenced as
-a Var in outer relation will be reused. The relationship between the outerrel.Var
-and subquery.exprs is built with outerel->subroot->processed_tlist.
+set-returning functions
+------------------------
 
+As for the SRF functions, it will break the uniqueness of uniquekey, However it
+is handled in adjust_paths_for_srfs, which happens after the query_planner.  So
+we will maintain the UniqueKey until there and reset it to NIL at that place.
 
-4. As for the SRF functions, it will break the uniqueness of uniquekey, However it
-is handled in adjust_paths_for_srfs, which happens after the query_planner. so
-we will maintain the UniqueKey until there and reset it to NIL at that
-places. This can't help on distinct/group by elimination cases but probably help
-in some other cases, like reduce_unqiue_semijoins/remove_useless_joins and it is
-semantic correctly.
+This can't help on distinct/group by elimination cases but probably help in some
+other cases, like reduce_unqiue_semijoins/remove_useless_joins and it is correct.
 
 
-5. As for inherit table, we first main the UnqiueKey on childrel as well. But for
-partitioned table we need to maintain 2 different kinds of
-UnqiueKey. 1). UniqueKey on the parent relation 2). UniqueKey on child
-relation for partition wise query.
+populate_partitionedrel_uniquekeys
+----------------------------------
+
+As for inherit table, we first build the UnqiueKey on childrel as well. But for
+partitioned table we need to maintain two different kinds of UniqueKey:
+
+1) UniqueKey on the parent relation
+
+2) UniqueKey on child
+
+This is needed because a unique key from the partition may not be be unique key
+on the partitioned table.
+
 
 Example:
-CREATE TABLE p (a int not null, b int not null) partition by list (a);
+
+CREATE TABLE p (a INT NOT NULL, b INT NOT NULL) PARTITION BY LIST (a);
+
 CREATE TABLE p0 partition of p for values in (1);
 CREATE TABLE p1 partition of p for values in (2);
 
-create unique index p0_b on p0(b);
-create unique index p1_b on p1(b);
+CREATE UNIQUE INDEX p0_b ON p0(b);
+CREATE UNIQUE INDEX p1_b ON p1(b);
 
-Now b is only unique on partition level, so the distinct can't be removed on
-the following cases. SELECT DISTINCT b FROM p;
+SELECT DISTINCT b FROM p;
 
-Another example is SELECT DISTINCT a, b FROM p WHERE a = 1; Since only one
-partition is chosen, the UniqueKey on child relation is same as the UniqueKey on
-parent relation.
+Now "b" is only unique on partition level, but the two partitions may contain
+duplicate values for the "b" column (with different values in "a"). That means
+the DISTINCT clause can't be removed.
 
-Another usage of UniqueKey on partition level is it be helpful for
-partition-wise join.
+Now consider:
 
-As for the UniqueKey on parent table level, it comes with 2 different ways,
-1). the UniqueKey is also derived in UniqueKey index, but the index must be same
-in all the related children relations and the unique index must contains
-Partition Key in it. Example:
+SELECT DISTINCT a, b FROM p WHERE a = 1
+
+In this case, the optimizer eliminates all partitions except for one, so that
+the UniqueKey is valid for the parent relation too.
+
+UniqueKey at a partition level is useful for partition-wise join too.
+
+XXX Explain why is it useful?
+
+A UniqueKey from a partition can be transferred to the parent relation, in two
+cases.  A trivial case is if there's a single child relation (e.g. thanks to
+partition elimination). In that case all unique keys on the child relation are
+automatically valid for the parent relation.  If there are multiple relations,
+the unique key must be defived from an index present in all partitions, and the
+index has to include the partition key.
+
+Example:
 
 CREATE UNIQUE INDEX p_ab ON p(a, b);  -- where a is the partition key.
 
 -- Query
 SELECT a, b FROM p; the (a, b) is a UniqueKey of p.
 
-2). If the parent relation has only one childrel, the UniqueKey on childrel is
- the UniqueKey on parent as well.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 66bf6f19f7..a801707eaa 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -581,7 +581,8 @@ set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 
 	/*
 	 * Now that we've marked which partial indexes are suitable, we can now
-	 * build the relation's unique keys.
+	 * build the relation's unique keys.  We need to do it in this order,
+	 * so that we don't deduce unique keys from inapplicable partial indexes.
 	 */
 	populate_baserel_uniquekeys(root, rel, rel->indexlist);
 
@@ -1305,6 +1306,12 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 
 	/* Add paths to the append relation. */
 	add_paths_to_append_rel(root, rel, live_childrels);
+
+	/*
+	 * XXX Maybe move the check into populate populate_partitionedrel_uniquekeys?
+	 * XXX What if it's append rel (but not partitioned one), but there's only one
+	 * child relation? We could still deduce unique keys, no?
+	 */
 	if (IS_PARTITIONED_REL(rel))
 		populate_partitionedrel_uniquekeys(root, rel, live_childrels);
 }
@@ -2314,6 +2321,7 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 										  pathkeys, required_outer));
 	}
 
+	/* Convert subpath's unique keys to outer representation */
 	convert_subquery_uniquekeys(root, rel, sub_final_rel);
 
 	/* If outer rel allows parallelism, do same for partial paths. */
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 7271f044ec..eefba449d6 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -925,6 +925,13 @@ populate_joinrel_with_paths(PlannerInfo *root, RelOptInfo *rel1,
 	/* Apply partitionwise join technique, if possible. */
 	try_partitionwise_join(root, rel1, rel2, joinrel, sjinfo, restrictlist);
 
+	/*
+	 * Determine which of the unique keys from input relations are applicable
+	 * for the join result.
+	 *
+	 * XXX We do this after trying the partitionwise join, because that may allow
+	 * using additional unique keys.
+	 */
 	populate_joinrel_uniquekeys(root, joinrel, rel1, rel2, restrictlist, sjinfo->jointype);
 }
 
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
index 77ed2b2eff..114e8334f5 100644
--- a/src/backend/optimizer/path/uniquekeys.c
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -36,7 +36,7 @@ typedef struct UniqueKeyContextData
 	bool	useful;
 } *UniqueKeyContext;
 
-static List *initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel);
+static List *initialize_uniquecontext_for_joinrel(RelOptInfo *inputrel);
 static bool innerrel_keeps_unique(PlannerInfo *root,
 								  RelOptInfo *outerrel,
 								  RelOptInfo *innerrel,
@@ -80,8 +80,20 @@ static void add_uniquekey_from_sortgroups(PlannerInfo *root,
 
 /*
  * populate_baserel_uniquekeys
- *		Populate 'baserel' uniquekeys list by looking at the rel's unique index
- * and baserestrictinfo
+ *		Build list of unique keys for the base relation.
+ *
+ * Inspects unique indexes defined on the relation and determines what
+ * unique keys are valid.  Partial indexes are considered too, if the
+ * predicate is valid.
+ *
+ * This also inspects baserestrictinfo, because we need to determine
+ * which opclass families are interesting when inspecting indexes. If we
+ * have a unique index and distinct clause with a mismatching opclasses,
+ * we should not use that.
+ *
+ * XXX Why does this look at baserestrictinfo?
+ *
+ * XXX What about collations?
  */
 void
 populate_baserel_uniquekeys(PlannerInfo *root,
@@ -99,22 +111,48 @@ populate_baserel_uniquekeys(PlannerInfo *root,
 
 	Assert(baserel->rtekind == RTE_RELATION);
 
+	if (!indexlist)
+		return;
+
+	/*
+	 * Determine which unique indexes to use to build the unique keys.
+	 * We have to skip partial with predicates not matched by the query,
+	 * and unique indexes that are not immediately enforced.
+	 *
+	 * XXX Do we actually skip indexes that are not immediate?
+	 * XXX What about hypothetical indexes?
+	 */
 	foreach(lc, indexlist)
 	{
 		IndexOptInfo *ind = (IndexOptInfo *) lfirst(lc);
+
 		if (!ind->unique || !ind->immediate ||
 			(ind->indpred != NIL && !ind->predOK))
 			continue;
+
 		matched_uniq_indexes = lappend(matched_uniq_indexes, ind);
 	}
 
+	/* If there are not applicable unique indexes, we're done. */
 	if (matched_uniq_indexes  == NIL)
 		return;
 
-	/* Check which attrs is used in baserel->reltarget */
-	pull_varattnos((Node *)baserel->reltarget->exprs, baserel->relid, &used_attrs);
+	/*
+	 * Determine which attrs are referenced in baserel->reltarget.  To use the
+	 * unique key info, we need all the columns - a unique index on (a,b) may
+	 * not be unique on (a).  If a column is missing in reltarget, the nodes
+	 * above can't possibly use it, and we can just ignore any matching index.
+	 */
+	pull_varattnos((Node *) baserel->reltarget->exprs, baserel->relid, &used_attrs);
 
-	/* Check which attrno is used at a mergeable const filter */
+	/*
+	 * Check which attrno is used at a mergeable const filter
+	 *
+	 * XXX This is not lookint att attrno at all, maybe obsolete comment?
+	 *
+	 * Seems the primary purpose of this is determining which opclass
+	 * families to use when matching unique indexes in the next loop?
+	 */
 	foreach(lc, baserel->baserestrictinfo)
 	{
 		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
@@ -122,6 +160,10 @@ populate_baserel_uniquekeys(PlannerInfo *root,
 		if (rinfo->mergeopfamilies == NIL)
 			continue;
 
+		/*
+		 * XXX What if bms_is_empty is true for both left_relids/right_relids?
+		 * Or what if it's false in both cases?
+		 */
 		if (bms_is_empty(rinfo->left_relids))
 		{
 			const_exprs = lappend(const_exprs, get_rightop(rinfo->clause));
@@ -136,40 +178,69 @@ populate_baserel_uniquekeys(PlannerInfo *root,
 		expr_opfamilies = lappend(expr_opfamilies, rinfo->mergeopfamilies);
 	}
 
+	/*
+	 * Now try to match unique indexes to attributes in reltarget, and to
+	 * merge operator families. The index may be on the right attributes,
+	 * but if it's not matching the opfamily it's useless.
+	 *
+	 * XXX Can we have multiple baserestrictinfo for the same attribute,
+	 * with different opfamilies? Probably not.
+	 */
 	foreach(lc, matched_uniq_indexes)
 	{
-		bool	multi_nullvals, useful;
-		List	*exprs = get_exprs_from_uniqueindex(lfirst_node(IndexOptInfo, lc),
-													const_exprs,
-													expr_opfamilies,
-													used_attrs,
-													&useful,
-													&multi_nullvals);
-		if (useful)
+		bool	multi_nullvals,
+				useful;
+
+		IndexOptInfo *index_info = (IndexOptInfo *) lfirst_node(IndexOptInfo, lc);
+
+		List   *exprs = get_exprs_from_uniqueindex(index_info,
+												   const_exprs,
+												   expr_opfamilies,
+												   used_attrs,
+												   &useful,
+												   &multi_nullvals);
+
+		if (!useful)
+			continue;
+
+		/*
+		 * All the columns in Unique Index matched with a restrictinfo, so
+		 * that we know there's just a one row in the result. If we find
+		 * such index, we're done - we discard all other unique keys and
+		 * keep just this special one. In principle, this is a stronger
+		 * guarantee, because all subsets of one row are still unique.
+		 *
+		 * XXX Is it correct to just return? Doesn't that prevent some
+		 * optimizations that might be possible with the other keys?
+		 */
+		if (exprs == NIL)
 		{
-			if (exprs == NIL)
-			{
-				/* All the columns in Unique Index matched with a restrictinfo */
-				add_uniquekey_for_onerow(baserel);
-				return;
-			}
-			baserel->uniquekeys = lappend(baserel->uniquekeys,
-										  makeUniqueKey(exprs, multi_nullvals));
+			/* discards all previous uniquekeys */
+			add_uniquekey_for_onerow(baserel);
+			return;
 		}
+
+		baserel->uniquekeys = lappend(baserel->uniquekeys,
+									  makeUniqueKey(exprs, multi_nullvals));
 	}
 }
 
 
 /*
  * populate_partitionedrel_uniquekeys
- * The UniqueKey on partitionrel comes from 2 cases:
- * 1). Only one partition is involved in this query, the unique key can be
- * copied to parent rel from childrel.
- * 2). There are some unique index which includes partition key and exists
- * in all the related partitions.
- * We never mind rule 2 if we hit rule 1.
+ *		Determine unique keys for a partitioned relation.
+ *
+ * Inspects unique keys for all partitions and derives unique keys that
+ * are valid for the whole partitioned table. There are two basic cases:
+ *
+ * 1) There's only one remaining partition (thanks to pruning all other
+ * partitions). In this case all the unique keys from the partition are
+ * trivially valid for the partitioned table.
+ *
+ * 2) All the partitions have the same unique index (on the same set of
+ * columns), and the index includes the partition key. This ensures the
+ * combination of values is unique for the whole partitioned table.
  */
-
 void
 populate_partitionedrel_uniquekeys(PlannerInfo *root,
 								   RelOptInfo *rel,
@@ -180,110 +251,181 @@ populate_partitionedrel_uniquekeys(PlannerInfo *root,
 	RelOptInfo *childrel;
 	bool is_first = true;
 
+	/* XXX What about append rels? At least for the one-child case? */
 	Assert(IS_PARTITIONED_REL(rel));
 
+	/* if there are no child relations, we're done. */
 	if (childrels == NIL)
 		return;
 
 	/*
-	 * If there is only one partition used in this query, the UniqueKey in childrel is
-	 * still valid in parent level, but we need convert the format from child expr to
-	 * parent expr.
+	 * If there is only one partition used in this query, the UniqueKey for
+	 * a child relation is still valid for the parent level. We need to
+	 * convert the format from child expr to parent expr.
 	 */
 	if (list_length(childrels) == 1)
 	{
-		/* Check for Rule 1 */
 		RelOptInfo *childrel = linitial_node(RelOptInfo, childrels);
 		ListCell	*lc;
+
 		Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL);
+
+		/* If the partition has a single row, so does the parent. */
 		if (relation_is_onerow(childrel))
 		{
 			add_uniquekey_for_onerow(rel);
 			return;
 		}
 
+		/*
+		 * Inspect the unique keys one by one, try reusing them for the
+		 * parent relation.
+		 *
+		 * FIXME This needs more work to handle expressions and not just
+		 * simple Vars.
+		 */
 		foreach(lc, childrel->uniquekeys)
 		{
+			ListCell   *lc2;
+			List	   *parent_exprs = NIL;
+			bool		can_reuse = true;
+
 			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
 			AppendRelInfo *appinfo = find_appinfo_by_child(root, childrel->relid);
-			List *parent_exprs = NIL;
-			bool can_reuse = true;
-			ListCell	*lc2;
+
+			/*
+			 * XXX Not sure what exactly we do here. Surely we deal with
+			 * expressions at child/parent level elsewhere? Can't we just
+			 * copy the code from there?
+			 */
 			foreach(lc2, ukey->exprs)
 			{
-				Var *var = (Var *)lfirst(lc2);
+				Var *var = (Var *) lfirst(lc2);
+
 				/*
-				 * If the expr comes from a expression, it is hard to build the expression
-				 * in parent so ignore that case for now.
+				 * XXX For now this only supports simple Var expressions,
+				 * so if there's a more complex expression we'll not copy
+				 * the unique key to the parent.
 				 */
 				if(!IsA(var, Var))
 				{
 					can_reuse = false;
 					break;
 				}
+
 				/* Convert it to parent var */
-				parent_exprs = lappend(parent_exprs, find_parent_var(appinfo, var));
+				parent_exprs = lappend(parent_exprs,
+									   find_parent_var(appinfo, var));
 			}
-			if (can_reuse)
-				rel->uniquekeys = lappend(rel->uniquekeys,
-										  makeUniqueKey(parent_exprs,
-														ukey->multi_nullvals));
+
+			/* ignore unique keys with complex expressions */
+			if (!can_reuse)
+				continue;
+
+			rel->uniquekeys = lappend(rel->uniquekeys,
+									  makeUniqueKey(parent_exprs,
+													ukey->multi_nullvals));
 		}
+
+		return;
 	}
-	else
+
+	/*
+	 * A parent with multiple child relations. We only care about indexes that
+	 * are in all child relations, so we loop through indexes on the first one
+	 * and check that they exist in the other child relations too.
+	 */
+
+	childrel = linitial_node(RelOptInfo, childrels);
+	foreach(lc, childrel->indexlist)
 	{
-		/* Check for rule 2 */
-		childrel = linitial_node(RelOptInfo, childrels);
-		foreach(lc, childrel->indexlist)
-		{
-			IndexOptInfo *ind = lfirst(lc);
-			IndexOptInfo *modified_index;
-			if (!ind->unique || !ind->immediate ||
-				(ind->indpred != NIL && !ind->predOK))
-				continue;
+		IndexOptInfo *ind = lfirst(lc);
+		IndexOptInfo *modified_index;
 
-			/*
-			 * During simple_copy_indexinfo_to_parent, we need to convert var from
-			 * child var to parent var, index on expression is too complex to handle.
-			 * so ignore it for now.
-			 */
-			if (ind->indexprs != NIL)
-				continue;
+		/*
+		 * Ignore indexes that are not unique, immediately enforced. Partial
+		 * indexes with mismatched predicate are useless too.
+		 */
+		if (!ind->unique || !ind->immediate ||
+			(ind->indpred != NIL && !ind->predOK))
+			continue;
 
-			modified_index = simple_copy_indexinfo_to_parent(root, rel, ind);
-			/*
-			 * If the unique index doesn't contain partkey, then it is unique
-			 * on this partition only, so it is useless for us.
-			 */
-			if (!index_constains_partkey(rel, modified_index))
-				continue;
+		/*
+		 * During simple_copy_indexinfo_to_parent, we need to convert var from
+		 * child var to parent var, index on expression is too complex to handle.
+		 * so ignore it for now.
+		 *
+		 * FIXME We should support indexes on expressions.
+		 */
+		if (ind->indexprs != NIL)
+			continue;
 
-			global_uniq_indexlist = lappend(global_uniq_indexlist,  modified_index);
-		}
+		/*
+		 * Adopt the index definition for the parent.
+		 *
+		 * XXX This seems rather weird. We're constructing "artificial" index
+		 * for the partitioned table (kinda like a global index). Can't we
+		 * just have some simpler struct representing it?
+		 */
+		modified_index = simple_copy_indexinfo_to_parent(root, rel, ind);
+
+		/*
+		 * If the unique index doesn't contain partkey, then it is unique
+		 * on this partition only, so it is useless for us.
+		 *
+		 * XXX Can't we do this check before simple_copy_indexinfo_to_parent?
+		 */
+		if (!index_constains_partkey(rel, modified_index))
+			continue;
 
-		if (global_uniq_indexlist != NIL)
+		global_uniq_indexlist = lappend(global_uniq_indexlist,  modified_index);
+	}
+
+	/* if there are no applicable unique indexes, we're done */
+	if (!global_uniq_indexlist)
+		return;
+
+	/*
+	 * We iterate over the child relations first, and inspect the unique
+	 * indexes for each hild, because this way we can stop early if we
+	 * happen to eliminate all the unique indexes.
+	 */
+	foreach(lc, childrels)
+	{
+		RelOptInfo *child = lfirst(lc);
+
+		/* skip the first index, which is where we got the list from */
+		if (is_first)
 		{
-			foreach(lc, childrels)
-			{
-				RelOptInfo *child = lfirst(lc);
-				if (is_first)
-				{
-					is_first = false;
-					continue;
-				}
-				adjust_partition_unique_indexlist(root, rel, child, &global_uniq_indexlist);
-			}
-			/* Now we have a list of unique index which are exactly same on all childrels,
-			 * Set the UniqueKey just like it is non-partition table
-			 */
-			populate_baserel_uniquekeys(root, rel, global_uniq_indexlist);
+			is_first = false;
+			continue;
 		}
+
+		/* match the unique keys to indexes on this child */
+		adjust_partition_unique_indexlist(root, rel, child, &global_uniq_indexlist);
+
+		/*
+		 * If we have eliminated all unique indexes, no point in looking at
+		 * the remaining child relations.
+		 */
+		if (!global_uniq_indexlist)
+			break;
 	}
+
+	/* Now we have a list of unique index which are exactly same on all child
+	 * relations. Set the UniqueKey just like it is non-partition table.
+	 */
+	populate_baserel_uniquekeys(root, rel, global_uniq_indexlist);
 }
 
 
 /*
  * populate_distinctrel_uniquekeys
+ *		Update unique keys for relation produced by DISTINCT.
+ *
+ * We can keep all unique keys from the input relations, because DISTINCT
+ * can only remove rows - it can't duplicate them. Also, the DISTINCT clause
+ * itself is a unique key, so add that.
  */
 void
 populate_distinctrel_uniquekeys(PlannerInfo *root,
@@ -292,11 +434,13 @@ populate_distinctrel_uniquekeys(PlannerInfo *root,
 {
 	/* The unique key before the distinct is still valid. */
 	distinctrel->uniquekeys = list_copy(inputrel->uniquekeys);
+
 	add_uniquekey_from_sortgroups(root, distinctrel, root->parse->distinctClause);
 }
 
 /*
  * populate_grouprel_uniquekeys
+ *		
  */
 void
 populate_grouprel_uniquekeys(PlannerInfo *root,
@@ -305,54 +449,76 @@ populate_grouprel_uniquekeys(PlannerInfo *root,
 
 {
 	Query *parse = root->parse;
-	bool input_ukey_added = false;
 	ListCell *lc;
 
+	/*
+	 * XXX Is this actually valid, before checking fro grouping sets?
+	 * The grouping sets may produce duplicate row even with just a single
+	 * input row, I think.
+	 */
 	if (relation_is_onerow(inputrel))
 	{
 		add_uniquekey_for_onerow(grouprel);
 		return;
 	}
+
+	/*
+	 * Bail out if there are grouping sets.
+	 *
+	 * XXX Could we maybe inspect the grouping sets and determine if this
+	 * generates distinct combinations? In some cases that's clearly not
+	 * the case (rollup, cube), but for some simple cases it might.
+	 */
 	if (parse->groupingSets)
 		return;
 
-	/* A Normal group by without grouping set. */
-	if (parse->groupClause)
+	/* It has aggregation but without a group by, so only one row returned */
+	if (!parse->groupClause)
+		add_uniquekey_for_onerow(grouprel);
+
+	/*
+	 * A regular group by, without grouping sets.
+	 *
+	 * Obviously, the whole group clause determines a unique key. But if
+	 * there are smaller unique keys on the input rel, we prefer those
+	 * because those are more flexible. If (a,b) is unique, (a,b,c) is
+	 * unique too. Only when there are no such smaller unique keys, we
+	 * add the unique key derived from the group clause.
+	 */
+	foreach(lc, inputrel->uniquekeys)
 	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+
 		/*
-		 * Current even the groupby clause is Unique already, but if query has aggref
-		 * We have to create grouprel still. To keep the UnqiueKey short, we will check
-		 * the UniqueKey of input_rel still valid, if so we reuse it.
+		 * Ignore unique keys on the input that are not subset of the
+		 * group clause. We can't use incomplete unique keys.
 		 */
-		foreach(lc, inputrel->uniquekeys)
-		{
-			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
-			if (list_is_subset(ukey->exprs, grouprel->reltarget->exprs))
-			{
-				grouprel->uniquekeys = lappend(grouprel->uniquekeys,
-											   ukey);
-				input_ukey_added = true;
-			}
-		}
-		if (!input_ukey_added)
-			/*
-			 * group by clause must be a super-set of grouprel->reltarget->exprs except the
-			 * aggregation expr, so if such exprs is unique already, no bother to generate
-			 * new uniquekey for group by exprs.
-			 */
-			add_uniquekey_from_sortgroups(root,
-										  grouprel,
-										  root->parse->groupClause);
+		if (!list_is_subset(ukey->exprs, grouprel->reltarget->exprs))
+			continue;
+
+		grouprel->uniquekeys = lappend(grouprel->uniquekeys, ukey);
 	}
-	else
-		/* It has aggregation but without a group by, so only one row returned */
-		add_uniquekey_for_onerow(grouprel);
+
+	/*
+	 * Group clause must be a super-set of of grouprel->reltarget->exprs,
+	 * except for the aggregation expressions. So if we found a smaller
+	 * unique key on the input relation, don't bother adding a unique key
+	 * for the group clause.
+	 */
+	if (!grouprel->uniquekeys)
+		add_uniquekey_from_sortgroups(root,
+									  grouprel,
+									  root->parse->groupClause);
 }
 
 /*
  * simple_copy_uniquekeys
- * Using a function for the one-line code makes us easy to check where we simply
- * copied the uniquekey.
+ *		Copy yhe unique keys between relations.
+ *
+ * Using a function for the one-line code makes us easy to check where we
+ * simply copied the uniquekey.
+ *
+ * XXX Seems like an overkill, not sure what's the purpose?
  */
 void
 simple_copy_uniquekeys(RelOptInfo *oldrel,
@@ -362,24 +528,27 @@ simple_copy_uniquekeys(RelOptInfo *oldrel,
 }
 
 /*
- *  populate_unionrel_uniquekeys
+ * populate_unionrel_uniquekeys
+ *		Determine unique keys for UNION relation.
+ *
+ * XXX Does this need to care about UNION vs. UNION ALL? At least in the
+ * one-row code path?
  */
 void
 populate_unionrel_uniquekeys(PlannerInfo *root,
-							  RelOptInfo *unionrel)
+							 RelOptInfo *unionrel)
 {
-	ListCell	*lc;
-	List	*exprs = NIL;
+	ListCell   *lc;
+	List	   *exprs = NIL;
 
 	Assert(unionrel->uniquekeys == NIL);
 
+	/* XXX Why are we copying the expressions? */
 	foreach(lc, unionrel->reltarget->exprs)
-	{
 		exprs = lappend(exprs, lfirst(lc));
-	}
 
+	/* SQL: select union select; is valid, we need to handle it here. */
 	if (exprs == NIL)
-		/* SQL: select union select; is valid, we need to handle it here. */
 		add_uniquekey_for_onerow(unionrel);
 	else
 		unionrel->uniquekeys = lappend(unionrel->uniquekeys,
@@ -389,6 +558,7 @@ populate_unionrel_uniquekeys(PlannerInfo *root,
 
 /*
  * populate_joinrel_uniquekeys
+ *		Determine unique keys for a join relation.
  *
  * populate uniquekeys for joinrel. We will check each relation to see if its
  * UniqueKey is still valid via innerrel_keeps_unique, if so, we add it to
@@ -404,70 +574,99 @@ populate_joinrel_uniquekeys(PlannerInfo *root, RelOptInfo *joinrel,
 							RelOptInfo *outerrel, RelOptInfo *innerrel,
 							List *restrictlist, JoinType jointype)
 {
-	ListCell *lc, *lc2;
-	List	*clause_list = NIL;
-	List	*outerrel_ukey_ctx;
-	List	*innerrel_ukey_ctx;
-	bool	inner_onerow, outer_onerow;
-	bool	mergejoin_allowed;
-
-	/* Care about the outerrel relation only for SEMI/ANTI join */
+	ListCell   *lc,
+			   *lc2;
+	List	   *clause_list = NIL;
+	List	   *outerrel_ukey_ctx;
+	List	   *innerrel_ukey_ctx;
+	bool		inner_onerow,
+				outer_onerow;
+	bool		mergejoin_allowed;
+
+	/* For SEMI/ANTI join, we care only about the outerrel unique keys. */
 	if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
 	{
 		foreach(lc, outerrel->uniquekeys)
 		{
 			UniqueKey	*uniquekey = lfirst_node(UniqueKey, lc);
+
+			/* Keep the unique key if it's included in the joinrel. */
 			if (list_is_subset(uniquekey->exprs, joinrel->reltarget->exprs))
 				joinrel->uniquekeys = lappend(joinrel->uniquekeys, uniquekey);
 		}
+
 		return;
 	}
 
+	/* XXX What about JOIN_RIGHT? */
 	Assert(jointype == JOIN_LEFT || jointype == JOIN_FULL || jointype == JOIN_INNER);
 
-	/* Fast path */
+	/*
+	 * For regular joins, we need to combine unique keys from both sides
+	 * of the join, to get a new unique key for the join relation. So if
+	 * either side does not have a unique key, bail out.
+	 */
 	if (innerrel->uniquekeys == NIL || outerrel->uniquekeys == NIL)
 		return;
 
+	/* XXX maybe move to the if blocks? Not needed outside. */
 	inner_onerow = relation_is_onerow(innerrel);
 	outer_onerow = relation_is_onerow(outerrel);
 
-	outerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(outerrel);
-	innerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(innerrel);
+	outerrel_ukey_ctx = initialize_uniquecontext_for_joinrel(outerrel);
+	innerrel_ukey_ctx = initialize_uniquecontext_for_joinrel(innerrel);
 
-	clause_list = select_mergejoin_clauses(root, joinrel, outerrel, innerrel,
+	clause_list = select_mergejoin_clauses(root,
+										   joinrel, outerrel, innerrel,
 										   restrictlist, jointype,
 										   &mergejoin_allowed);
 
-	if (innerrel_keeps_unique(root, innerrel, outerrel, clause_list, true /* reverse */))
+	/*
+	 * XXX Seems a bit weird that it's called innerrel_keeps_unique but we
+	 * seem to use it in both directions. Or what's the "reverse" for? The
+	 * "reverse" name is not particularly descriptive.
+	 */
+	if (innerrel_keeps_unique(root, innerrel, outerrel, clause_list, true))
 	{
-		bool outer_impact = jointype == JOIN_FULL;
+		bool	outer_impact = (jointype == JOIN_FULL);
+
+		/* Inspect unique keys on the outer relation. */
 		foreach(lc, outerrel_ukey_ctx)
 		{
 			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
 
+			/*
+			 * If the output of the join does not include all the parts of the
+			 * unique key, it's useless, so mark it accordingly and ignore it.
+			 */
 			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
 			{
 				ctx->useful = false;
 				continue;
 			}
 
-			/* Outer relation has one row, and the unique key is not duplicated after join,
-			 * the joinrel will still has one row unless the jointype == JOIN_FULL.
+			/*
+			 * When the outer relation has one row, and the unique key is not
+			 * duplicated after join, so the joinrel will still have just one
+			 * row unless the jointype == JOIN_FULL. In that case we're done,
+			 * it's the strictest unique key possible.
+			 *
+			 * If it's one-row with a JOIN_FULL, it might produce multiple
+			 * rows with NULLs, so set multi_nullvals. We also need to set
+			 * the exprs correctly since it can't be NIL any more.
+			 *
+			 * For other cases (not one-row relation), we just reuse the
+			 * unique key, but we may need to tweak the multi_nullvals.
 			 */
 			if (outer_onerow && !outer_impact)
 			{
 				add_uniquekey_for_onerow(joinrel);
 				return;
 			}
-			else if (outer_onerow)
+			else if (outer_onerow)	/* one-row and FULL join */
 			{
-				/*
-				 * The onerow outerrel becomes multi rows and multi_nullvals
-				 * will be changed to true. We also need to set the exprs correctly since it
-				 * can't be NIL any more.
-				 */
 				ListCell *lc2;
+
 				foreach(lc2, get_exprs_from_uniquekey(root, joinrel, outerrel, NULL))
 				{
 					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
@@ -485,18 +684,38 @@ populate_joinrel_uniquekeys(PlannerInfo *root, RelOptInfo *joinrel,
 					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
 												  ctx->uniquekey);
 			}
+
+			/*
+			 * Mark the unique key as added, so that we can ignore it later
+			 * when combining unique keys from both sides of the join.
+			 */
 			ctx->added_to_joinrel = true;
 		}
 	}
 
+	/*
+	 * XXX Seems this actually checks if "outerrel keeps unique" so the name
+	 * is misleading. Of maybe it's the previous block, not sure.
+	 *
+	 * XXX So why does this consider JOIN_FULL and JOIN_LEFT, while the previous
+	 * block only cares about JOIN_FULL?
+	 *
+	 * XXX This is almost exact copy of the previous block, so maybe make it
+	 * a separate function and just call it twice?
+	 */
 	if (innerrel_keeps_unique(root, outerrel, innerrel, clause_list, false))
 	{
-		bool outer_impact = jointype == JOIN_FULL || jointype == JOIN_LEFT;;
+		bool	outer_impact = (jointype == JOIN_FULL || jointype == JOIN_LEFT);
 
+		/* Inspect unique keys on the inner relation. */
 		foreach(lc, innerrel_ukey_ctx)
 		{
 			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
 
+			/*
+			 * If the output of the join does not include all the parts of the
+			 * unique key, it's useless, so mark it accordingly and ignore it.
+			 */
 			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
 			{
 				ctx->useful = false;
@@ -529,29 +748,52 @@ populate_joinrel_uniquekeys(PlannerInfo *root, RelOptInfo *joinrel,
 												  ctx->uniquekey);
 
 			}
+
+			/*
+			 * Mark the unique key as added, so that we can ignore it later
+			 * when combining unique keys from both sides of the join.
+			 */
 			ctx->added_to_joinrel = true;
 		}
 	}
 
 	/*
-	 * The combination of the UniqueKey from both sides is unique as well regardless
-	 * of join type, but no bother to add it if its subset has been added to joinrel
-	 * already or it is not useful for the joinrel.
+	 * XXX What if either of the previous two conditions did not match? In
+	 * that case we haven't updated the useful flag, and maybe the unique
+	 * key is not useful, but we don't know, right? So we should not be
+	 * using it in the next loop. Or maybe we should evaluate the flag
+	 * before the loops.
+	 */
+
+	/*
+	 * The combination of the UniqueKey from both sides is unique as well,
+	 * regardless of the join type. But don't bother to add it if its
+	 * subset has been added to joinrel already or when it's not useful for
+	 * the joinrel.
+	 *
+	 * XXX Maybe we should have a flag that both sides have useful keys?
+	 * Or maybe the loops are short/cheap?
 	 */
 	foreach(lc, outerrel_ukey_ctx)
 	{
 		UniqueKeyContext ctx1 = (UniqueKeyContext) lfirst(lc);
+
+		/* when not useful or already added to the joinrel, skip it */
 		if (ctx1->added_to_joinrel || !ctx1->useful)
 			continue;
+
 		foreach(lc2, innerrel_ukey_ctx)
 		{
 			UniqueKeyContext ctx2 = (UniqueKeyContext) lfirst(lc2);
+
+			/* when not useful or already added to the joinrel, skip it */
 			if (ctx2->added_to_joinrel || !ctx2->useful)
 				continue;
+
+			/* If we add a onerow UniqueKey, we don't need another key. */
 			if (add_combined_uniquekey(root, joinrel, outerrel, innerrel,
 									   ctx1->uniquekey, ctx2->uniquekey,
 									   jointype))
-				/* If we set a onerow UniqueKey to joinrel, we don't need other. */
 				return;
 		}
 	}
@@ -560,8 +802,9 @@ populate_joinrel_uniquekeys(PlannerInfo *root, RelOptInfo *joinrel,
 
 /*
  * convert_subquery_uniquekeys
+ *		Covert the UniqueKey in subquery to outer relation.
  *
- * Covert the UniqueKey in subquery to outer relation.
+ * XXX Explain what exactly does the conversion do?
  */
 void convert_subquery_uniquekeys(PlannerInfo *root,
 								 RelOptInfo *currel,
@@ -618,12 +861,14 @@ void convert_subquery_uniquekeys(PlannerInfo *root,
 
 /*
  * innerrel_keeps_unique
+ *		Check if Unique key on the innerrel is valid after join.
  *
- * Check if Unique key of the innerrel is valid after join. innerrel's UniqueKey
- * will be still valid if innerrel's any-column mergeop outrerel's uniquekey
- * exists in clause_list.
+ * innerrel's UniqueKey will be still valid if innerrel's any-column mergeop
+ * outrerel's uniquekey exists in clause_list
  *
  * Note: the clause_list must be a list of mergeable restrictinfo already.
+ *
+ * XXX Misleading name? We seem to use it for "outerrel_keeps_unique" too.
  */
 static bool
 innerrel_keeps_unique(PlannerInfo *root,
@@ -634,26 +879,32 @@ innerrel_keeps_unique(PlannerInfo *root,
 {
 	ListCell	*lc, *lc2, *lc3;
 
+	/* XXX probably not needed, duplicate with the check in the caller
+	 * (populate_joinrel_uniquekeys). But it's cheap. */
 	if (outerrel->uniquekeys == NIL || innerrel->uniquekeys == NIL)
 		return false;
 
 	/* Check if there is outerrel's uniquekey in mergeable clause. */
 	foreach(lc, outerrel->uniquekeys)
 	{
-		List	*outer_uq_exprs = lfirst_node(UniqueKey, lc)->exprs;
-		bool clauselist_matchs_all_exprs = true;
+		List   *outer_uq_exprs = lfirst_node(UniqueKey, lc)->exprs;
+		bool	clauselist_matchs_all_exprs = true;
+
 		foreach(lc2, outer_uq_exprs)
 		{
 			Node *outer_uq_expr = lfirst(lc2);
 			bool find_uq_expr_in_clauselist = false;
+
 			foreach(lc3, clause_list)
 			{
 				RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc3);
 				Node *outer_expr;
+
 				if (reverse)
 					outer_expr = rinfo->outer_is_left ? get_rightop(rinfo->clause) : get_leftop(rinfo->clause);
 				else
 					outer_expr = rinfo->outer_is_left ? get_leftop(rinfo->clause) : get_rightop(rinfo->clause);
+
 				if (equal(outer_expr, outer_uq_expr))
 				{
 					find_uq_expr_in_clauselist = true;
@@ -677,22 +928,37 @@ innerrel_keeps_unique(PlannerInfo *root,
 
 /*
  * relation_is_onerow
- * Check if it is a one-row relation by checking UniqueKey.
+ *		Check if it is a one-row relation by checking UniqueKey.
+ *
+ * The one-row is a special case - there has to be just a single unique key,
+ * with no expressions.
  */
 bool
 relation_is_onerow(RelOptInfo *rel)
 {
 	UniqueKey *ukey;
-	if (rel->uniquekeys == NIL)
+
+	/* there has to be exactly one unique key */
+	if (list_length(rel->uniquekeys) != 1)
 		return false;
+
 	ukey = linitial_node(UniqueKey, rel->uniquekeys);
-	return ukey->exprs == NIL && list_length(rel->uniquekeys) == 1;
+
+	/* the unique key must have no expressions */
+	return (ukey->exprs == NIL);
 }
 
 /*
  * relation_has_uniquekeys_for
- *		Returns true if we have proofs that 'rel' cannot return multiple rows with
- *		the same values in each of 'exprs'.  Otherwise returns false.
+ *		Determines if the relation has unique key for a list of expressions.
+ *
+ * Returns true iff we can prove that the relation cannot return multiple rows
+ * with the same values in the provided expression.
+ *
+ * allow_multinulls determines whether we allow multiple NULL values or not.
+ *
+ * The special "one-row" unique key is considered incompatible with all
+ * possible expressions.
  */
 bool
 relation_has_uniquekeys_for(PlannerInfo *root, RelOptInfo *rel,
@@ -710,20 +976,39 @@ relation_has_uniquekeys_for(PlannerInfo *root, RelOptInfo *rel,
 	foreach(lc, rel->uniquekeys)
 	{
 		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+
 		if (ukey->multi_nullvals && !allow_multinulls)
 			continue;
+
 		if (list_is_subset(ukey->exprs, exprs))
 			return true;
 	}
+
 	return false;
 }
 
 
 /*
  * get_exprs_from_uniqueindex
+ *		Return a list of expressions from a unique index.
+ *
+ * Provided with a list of expressions and opclass families, we try to match
+ * it to the index. If useful, we produce a list of index expressions (subset
+ * of the list we provided).
+ *
+ * We simply walk through the index expressions, and for each expression we
+ * check three things:
  *
- * Return a list of exprs which is unique. set useful to false if this
- * unique index is not useful for us.
+ * 1) If there's a matching (expr = Const) clause, we can simply ignore the
+ * expressions. Unique index on (a,b,c) guarantees uniqueness on (a,b) when
+ * there's condition (c=1).
+ *
+ * 2) Check that the index expression is present in the relation we're
+ * dealing with. If not, the unique key would be useless anyway, and the
+ * index can't produce unique key.
+ *
+ * XXX Shouldn't it be enough to return NULL when the index is not useful?
+ * The extra flag seems a bit unnecessary.
  */
 static List *
 get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
@@ -743,18 +1028,19 @@ get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
 	indexpr_item = list_head(unique_index->indexprs);
 	for(c = 0; c < unique_index->ncolumns; c++)
 	{
-		int attr = unique_index->indexkeys[c];
-		Expr *expr;
-		bool	matched_const = false;
-		ListCell	*lc1, *lc2;
+		int			attr = unique_index->indexkeys[c];
+		Expr	   *expr;
+		bool		matched_const = false;
+		ListCell   *lc1, *lc2;
 
-		if(attr > 0)
+		if (attr > 0)
 		{
+			/* regular attribute, just use the expression from index tlist */
 			expr = list_nth_node(TargetEntry, unique_index->indextlist, c)->expr;
 		}
 		else if (attr == 0)
 		{
-			/* Expression index */
+			/* expression from the index */
 			expr = lfirst(indexpr_item);
 			indexpr_item = lnext(unique_index->indexprs, indexpr_item);
 		}
@@ -764,29 +1050,43 @@ get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
 			Assert(false);
 		}
 
+		/* should have a valid expression now */
+		Assert(expr);
+
 		/*
-		 * Check index_col = Const case with regarding to opfamily checking
-		 * If we can remove the index_col from the final UniqueKey->exprs.
+		 * Check if there's (index_col = Const) condition, and that it's using
+		 * a compatible opfamily. If yes, we can remove the index_col from the
+		 * final UniqueKey->exprs, because the value is constant (so removing
+		 * it can't introduce duplicities).
 		 */
 		forboth(lc1, const_exprs, lc2, const_expr_opfamilies)
 		{
-			if (list_member_oid((List *)lfirst(lc2), unique_index->opfamily[c])
-				&& match_index_to_operand((Node *) lfirst(lc1), c, unique_index))
+			List   *opfamilies = (List *) lfirst(lc2);
+			Node   *cexpr = (Node *) lfirst(lc1);
+
+			if (list_member_oid(opfamilies, unique_index->opfamily[c]) &&
+				match_index_to_operand(cexpr, c, unique_index))
 			{
 				matched_const = true;
 				break;
 			}
 		}
 
+		/* it's constant, so ignore the expression */
 		if (matched_const)
 			continue;
 
-		/* Check if the indexed expr is used in rel */
+		/*
+		 * Check if the indexed expr is used in rel. We do this after the
+		 * (col = Const) check, because nn expression may be in a a restrict
+		 * clause and not in the reltarget. So we don't want to rule out an
+		 * index unnecessarily.
+		 */
 		if (attr > 0)
 		{
 			/*
-			 * Normal Indexed column, if the col is not used, then the index is useless
-			 * for uniquekey.
+			 * Normal indexed column, if the col is not used, then the index
+			 * is useless for uniquekey.
 			 */
 			attr -= FirstLowInvalidHeapAttributeNumber;
 
@@ -806,67 +1106,85 @@ get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
 		/* check not null property. */
 		if (attr == 0)
 		{
-			/* We never know if a expression yields null or not */
+			/* We never know if an expression yields null or not */
 			*multi_nullvals = true;
 		}
-		else if (!bms_is_member(attr, unique_index->rel->notnullattrs)
-				 && !bms_is_member(0 - FirstLowInvalidHeapAttributeNumber,
-								   unique_index->rel->notnullattrs))
+		else if (!bms_is_member(attr, unique_index->rel->notnullattrs) &&
+				 !bms_is_member(0 - FirstLowInvalidHeapAttributeNumber,
+								unique_index->rel->notnullattrs))
 		{
 			*multi_nullvals = true;
 		}
 
 		exprs = lappend(exprs, expr);
 	}
+
 	return exprs;
 }
 
 
 /*
  * add_uniquekey_for_onerow
- * If we are sure that the relation only returns one row, then all the columns
- * are unique. However we don't need to create UniqueKey for every column, we
- * just set exprs = NIL and overwrites all the other UniqueKey on this RelOptInfo
- * since this one has strongest semantics.
+ *		Create a special unique key signifying that the rel has one row.
+ *
+ * If we are sure that the relation only returns one row (it might return
+ * no rows, but we still consider that unique), then all the columns are
+ * trivially unique.
+ *
+ * However we don't need to create UniqueKey with every column, we just
+ * set exprs = NIL, because that's easier to identify. We don't want to
+ * add unnecessary unique keys (such that we already have a unique key
+ * for a subset of the expressions), and with (exprs == NIL) we can just
+ * assume we have one unique key for each column in the rel.
+ *
+ * We discard all other unique keys, since it has the strongest semantics.
  */
 void
 add_uniquekey_for_onerow(RelOptInfo *rel)
 {
 	/*
-	 * We overwrite the previous UniqueKey on purpose since this one has the
-	 * strongest semantic.
+	 * We overwrite the previous UniqueKey on purpose since this one has
+	 * the strongest semantic (all other unique keys are implied by it).
 	 */
 	rel->uniquekeys = list_make1(makeUniqueKey(NIL, false));
 }
 
 
 /*
- * initililze_uniquecontext_for_joinrel
- * Return a List of UniqueKeyContext for an inputrel
+ * initialize_uniquecontext_for_joinrel
+ *		Return a List of UniqueKeyContext for an inputrel.
  */
 static List *
-initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel)
+initialize_uniquecontext_for_joinrel(RelOptInfo *inputrel)
 {
-	List	*res = NIL;
-	ListCell *lc;
-	foreach(lc,  inputrel->uniquekeys)
+	List	   *res = NIL;
+	ListCell   *lc;
+
+	foreach(lc, inputrel->uniquekeys)
 	{
 		UniqueKeyContext context;
+
 		context = palloc(sizeof(struct UniqueKeyContextData));
 		context->uniquekey = lfirst_node(UniqueKey, lc);
 		context->added_to_joinrel = false;
 		context->useful = true;
+
 		res = lappend(res, context);
 	}
+
 	return res;
 }
 
-
 /*
  * get_exprs_from_uniquekey
- *	Unify the way of get List of exprs from a one-row UniqueKey or
- * normal UniqueKey. for the onerow case, every expr in rel1 is a valid
- * UniqueKey. Return a List of exprs.
+ *		Extract expressions that are part of a unique key.
+ *
+ * The meaning of the result is a bit different in regular and one-row cases.
+ * For the regular case, the list of expressions form a single unique key,
+ * i.e. the combination of values is unique.
+ *
+ * For the one-row case, each individual expression is known to be unique
+ * (simply because in a single row everything is unique).
  *
  * rel1: The relation which you want to get the exprs.
  * ukey: The UniqueKey you want to get the exprs.
@@ -875,27 +1193,29 @@ static List *
 get_exprs_from_uniquekey(PlannerInfo *root, RelOptInfo *joinrel,
 						 RelOptInfo *rel1, UniqueKey *ukey)
 {
-	ListCell *lc;
-	bool onerow = rel1 != NULL && relation_is_onerow(rel1);
+	ListCell   *lc;
+	List	   *res = NIL;
+	bool		onerow = (rel1 != NULL) && relation_is_onerow(rel1);
 
-	List	*res = NIL;
+	/* We require at least one of those to be true. */
 	Assert(onerow || ukey);
-	if (onerow)
-	{
-		/* Only cares about the exprs still exist in joinrel */
-		foreach(lc, joinrel->reltarget->exprs)
-		{
-			Bitmapset *relids = pull_varnos(root, lfirst(lc));
-			if (bms_is_subset(relids, rel1->relids))
-			{
-				res = lappend(res, list_make1(lfirst(lc)));
-			}
-		}
-	}
-	else
+
+	/* if not a one-row unique key, just return the key's expressions */
+	if (!onerow)
+		return list_make1(ukey->exprs);
+
+	/*
+	 * If it's a one-row relation, we simply extract the expressions that
+	 * still exist in the reltarget.
+	 */
+	foreach(lc, joinrel->reltarget->exprs)
 	{
-		res = list_make1(ukey->exprs);
+		Bitmapset  *relids = pull_varnos(root, lfirst(lc));
+
+		if (bms_is_subset(relids, rel1->relids))
+			res = lappend(res, list_make1(lfirst(lc)));
 	}
+
 	return res;
 }
 
@@ -910,55 +1230,67 @@ get_exprs_from_uniquekey(PlannerInfo *root, RelOptInfo *joinrel,
 
 /*
  * index_constains_partkey
- * return true if the index contains the partiton key.
+ *		Determines if the index includes a partition key.
+ *
+ * XXX Surely we already have a code doing this already? E.g. when creating
+ * a unique index on a partitioned table we define that.
  */
 static bool
-index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind)
+index_constains_partkey(RelOptInfo *partrel, IndexOptInfo *ind)
 {
 	ListCell	*lc;
 	int	i;
+
 	Assert(IS_PARTITIONED_REL(partrel));
 	Assert(partrel->part_scheme->partnatts > 0);
 
 	for(i = 0; i < partrel->part_scheme->partnatts; i++)
 	{
-		Node *part_expr = linitial(partrel->partexprs[i]);
-		bool found_in_index = false;
+		Node   *part_expr = linitial(partrel->partexprs[i]);
+		bool	found_in_index = false;
+
 		foreach(lc, ind->indextlist)
 		{
-			Expr *index_expr = lfirst_node(TargetEntry, lc)->expr;
+			Expr   *index_expr = lfirst_node(TargetEntry, lc)->expr;
+
 			if (equal(index_expr, part_expr))
 			{
 				found_in_index = true;
 				break;
 			}
 		}
+
 		if (!found_in_index)
 			return false;
 	}
+
 	return true;
 }
 
 /*
  * simple_indexinfo_equal
+ *		Compare two indexes to determine if they are the same.
+ *
+ * We need to do this because simple_copy_indexinfo_to_parent does change
+ * some elements. So this is not exactly the same as calling equal().
  *
- * Used to check if the 2 index is same as each other. The index here
- * is COPIED from childrel and did some tiny changes(see
- * simple_copy_indexinfo_to_parent)
+ * XXX I wonder if we could simply use equal(), somehow? In fact, we should
+ * probably build something much simpler than IndexOptInfo, just enough to
+ * do the checks.
  */
 static bool
 simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2)
 {
 	Size oid_cmp_len = sizeof(Oid) * ind1->ncolumns;
 
-	return ind1->ncolumns == ind2->ncolumns &&
-		ind1->unique == ind2->unique &&
-		memcmp(ind1->indexkeys, ind2->indexkeys, sizeof(int) * ind1->ncolumns) == 0 &&
-		memcmp(ind1->opfamily, ind2->opfamily, oid_cmp_len) == 0 &&
-		memcmp(ind1->opcintype, ind2->opcintype, oid_cmp_len) == 0 &&
-		memcmp(ind1->sortopfamily, ind2->sortopfamily, oid_cmp_len) == 0 &&
-		equal(get_tlist_exprs(ind1->indextlist, true),
-			  get_tlist_exprs(ind2->indextlist, true));
+	return ((ind1->ncolumns == ind2->ncolumns) &&
+			(ind1->unique == ind2->unique) &&
+			(memcmp(ind1->indexkeys, ind2->indexkeys, sizeof(int) * ind1->ncolumns) == 0) &&
+			(memcmp(ind1->opfamily, ind2->opfamily, oid_cmp_len) == 0) &&
+			(memcmp(ind1->opcintype, ind2->opcintype, oid_cmp_len) == 0) &&
+			(memcmp(ind1->sortopfamily, ind2->sortopfamily, oid_cmp_len) == 0) &&
+			(equal(get_tlist_exprs(ind1->indextlist, true),
+				   get_tlist_exprs(ind2->indextlist, true))));
 }
 
 
@@ -981,11 +1313,21 @@ simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2)
 
 
 /*
- * simple_copy_indexinfo_to_parent (from partition)
- * Copy the IndexInfo from child relation to parent relation with some modification,
- * which is used to test:
- * 1. If the same index exists in all the childrels.
+ * simple_copy_indexinfo_to_parent
+ *		Copy index info from child to parent, with necessary tweaks.
+ *
+ * We use this copy to check:
+ *
+ * 1. If the same/matching index exists in all the childrels.
  * 2. If the parentrel->reltarget/basicrestrict info matches this index.
+ *
+ * XXX IMHO we should probably build something much simpler than a full
+ * IndexOptInfo copy, just enough to do the checks.
+ *
+ * XXX The fact that we copy so much data seems wrong, and having to
+ * define macros from copyfuncs.c seems like a very suspicious thing.
+ * One reason is that IndeOptInfo is fairly large struct, especially
+ * with all the fields, and we allocate it very often.
  */
 static IndexOptInfo *
 simple_copy_indexinfo_to_parent(PlannerInfo *root,
@@ -1027,20 +1369,24 @@ simple_copy_indexinfo_to_parent(PlannerInfo *root,
 
 /*
  * adjust_partition_unique_indexlist
+ *		Checks and eliminates indexes that do not exist on the child relation.
  *
- * global_unique_indexes: At the beginning, it contains the copy & modified
- * unique index from the first partition. And then check if each index in it still
- * exists in the following partitions. If no, remove it. at last, it has an
- * index list which exists in all the partitions.
+ * Walks the list of unique indexes, and eliminates those that don't match
+ * the child relation (i.e. where a matching child index does not exist).
+ * This is used to iteratively filter the list of candidate unique keys.
+ *
+ * After processing all child relations, the list contains only indexes that
+ * exist in all the child relations.
  */
 static void
 adjust_partition_unique_indexlist(PlannerInfo *root,
 								  RelOptInfo *parentrel,
 								  RelOptInfo *childrel,
-								  List **global_unique_indexes)
+								  List **indexes)
 {
 	ListCell	*lc, *lc2;
-	foreach(lc, *global_unique_indexes)
+
+	foreach(lc, *indexes)
 	{
 		IndexOptInfo	*g_ind = lfirst_node(IndexOptInfo, lc);
 		bool found_in_child = false;
@@ -1049,23 +1395,45 @@ adjust_partition_unique_indexlist(PlannerInfo *root,
 		{
 			IndexOptInfo   *p_ind = lfirst_node(IndexOptInfo, lc2);
 			IndexOptInfo   *p_ind_copy;
-			if (!p_ind->unique || !p_ind->immediate ||
-				(p_ind->indpred != NIL && !p_ind->predOK))
+
+			/*
+			 * Ignore child indexes that can't possibly match (not unique or
+			 * immediate, etc.)
+			 *
+			 * XXX We do these checks in many places, so maybe turn it into
+			 * a reusable macro?
+			 */
+			if ((!p_ind->unique) || (!p_ind->immediate) ||
+				(p_ind->indpred != NIL) && (!p_ind->predOK))
 				continue;
+
+			/*
+			 * XXX This seems possibly quite expensive. Imagine there are many
+			 * child relations, with a bunch of unique indexes each. Then this
+			 * generates a copy for each unique index in each child relation,
+			 * something like O(N^2/2) copies.
+			 */
 			p_ind_copy = simple_copy_indexinfo_to_parent(root, parentrel, p_ind);
+
+			/* Found a matching index for the child relation, we're done. */
 			if (simple_indexinfo_equal(p_ind_copy, g_ind))
 			{
 				found_in_child = true;
 				break;
 			}
 		}
+
+		/* No matching index in the child, so remove it from the list. */
 		if (!found_in_child)
-			/* The index doesn't exist in childrel, remove it from global_unique_indexes */
-			*global_unique_indexes = foreach_delete_current(*global_unique_indexes, lc);
+			*indexes = foreach_delete_current(*indexes, lc);
 	}
 }
 
-/* Helper function for groupres/distinctrel */
+/*
+ * Helper function for groupres/distinctrel
+ *
+ * FIXME Not sure about this.
+ */
 static void
 add_uniquekey_from_sortgroups(PlannerInfo *root, RelOptInfo *rel, List *sortgroups)
 {
@@ -1073,27 +1441,32 @@ add_uniquekey_from_sortgroups(PlannerInfo *root, RelOptInfo *rel, List *sortgrou
 	List	*exprs;
 
 	/*
-	 * XXX: If there are some vars which is not in current levelsup, the semantic is
-	 * imprecise, should we avoid it or not? levelsup = 1 is just a demo, maybe we need to
-	 * check every level other than 0, if so, looks we have to write another
-	 * pull_var_walker.
+	 * XXX: If there are some vars which are not in the current levelsup, the
+	 * semantic is imprecise, should we avoid it or not? levelsup = 1 is just
+	 * a demo, maybe we need to check every level other than 0, if so, looks
+	 * we have to write another pull_var_walker.
 	 */
 	List	*upper_vars = pull_vars_of_level((Node*)sortgroups, 1);
 
 	if (upper_vars != NIL)
 		return;
 
+	/* sortgroupclause can't be multi_nullvals */
 	exprs = get_sortgrouplist_exprs(sortgroups, parse->targetList);
 	rel->uniquekeys = lappend(rel->uniquekeys,
-							  makeUniqueKey(exprs,
-											false /* sortgroupclause can't be multi_nullvals */));
+							  makeUniqueKey(exprs, false));
 }
 
 
 /*
  * add_combined_uniquekey
- * The combination of both UniqueKeys is a valid UniqueKey for joinrel no matter
- * the jointype.
+ *		Add a unique key for a join, combined from keys on inner/outer side.
+ *
+ * The combination of both UniqueKeys is a valid UniqueKey for joinrel no
+ * matter what's the exact jointype.
+ *
+ * Returns true if the unique key is "one-row" variant, so that the caller
+ * can stop considering further combinations.
  */
 bool
 add_combined_uniquekey(PlannerInfo *root,
@@ -1104,32 +1477,47 @@ add_combined_uniquekey(PlannerInfo *root,
 					   UniqueKey *inner_ukey,
 					   JoinType jointype)
 {
+	bool		multi_nullvals;
+	ListCell   *lc1, *lc2;
 
-	ListCell	*lc1, *lc2;
-
-	/* Either side has multi_nullvals or we have outer join,
-	 * the combined UniqueKey has multi_nullvals */
-	bool multi_nullvals = outer_ukey->multi_nullvals ||
+	/*
+	 * If either side has multi_nullvals, or we are dealing with an outer join,
+	 * the combined UniqueKey has multi_nullvals too.
+	 */
+	multi_nullvals = outer_ukey->multi_nullvals ||
 		inner_ukey->multi_nullvals || IS_OUTER_JOIN(jointype);
 
 	/* The only case we can get onerow joinrel after join */
-	if  (relation_is_onerow(outer_rel)
-		 && relation_is_onerow(inner_rel)
-		 && jointype == JOIN_INNER)
+	if (relation_is_onerow(outer_rel) &&
+		relation_is_onerow(inner_rel) &&
+		jointype == JOIN_INNER)
 	{
 		add_uniquekey_for_onerow(joinrel);
 		return true;
 	}
 
+	/*
+	 * XXX Isn't this wrong? Why is it combining expressions that are part
+	 * of the two unique keys? Imagine we have outer unique key on (a1, a2)
+	 * and inner outer key on (b1, b2). Then this adds four unique keys
+	 * for the join (a1,b1), (a1,b2), (a2,b1) and (a2,b2). Shouldn't it
+	 * just add (a1,a2,b1,b2)?
+	 */
 	foreach(lc1, get_exprs_from_uniquekey(root, joinrel, outer_rel, outer_ukey))
 	{
+		/*
+		 * XXX This calls get_exprs_from_uniquekey repeatedly for each outer
+		 * loop. Maybe we should calculate it just once before the loop.
+		 */
 		foreach(lc2, get_exprs_from_uniquekey(root, joinrel, inner_rel, inner_ukey))
 		{
 			List *exprs = list_concat_copy(lfirst_node(List, lc1), lfirst_node(List, lc2));
+
 			joinrel->uniquekeys = lappend(joinrel->uniquekeys,
 										  makeUniqueKey(exprs,
 														multi_nullvals));
 		}
 	}
+
 	return false;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 8d8e493f5c..f29b65c07b 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2387,6 +2387,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 		add_path(final_rel, path);
 	}
 
+	/* XXX comment? can we simply just copy the unique keys to the final relation? */
 	simple_copy_uniquekeys(current_rel, final_rel);
 
 	/*
@@ -3902,7 +3903,9 @@ create_grouping_paths(PlannerInfo *root,
 
 	set_cheapest(grouped_rel);
 
+	/* XXX does this apply to grouping sets too? */
 	populate_grouprel_uniquekeys(root, grouped_rel, input_rel);
+
 	return grouped_rel;
 }
 
@@ -4625,7 +4628,10 @@ create_window_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(window_rel);
+
+	/* XXX comment? */
 	simple_copy_uniquekeys(input_rel, window_rel);
+
 	return window_rel;
 }
 
@@ -4939,7 +4945,10 @@ create_distinct_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(distinct_rel);
+
+	/* XXX comment */
 	populate_distinctrel_uniquekeys(root, input_rel, distinct_rel);
+
 	return distinct_rel;
 }
 
@@ -5200,6 +5209,7 @@ create_ordered_paths(PlannerInfo *root,
 	 */
 	Assert(ordered_rel->pathlist != NIL);
 
+	/* XXX comment */
 	simple_copy_uniquekeys(input_rel, ordered_rel);
 
 	return ordered_rel;
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index b7626545bf..72a3f3c598 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -691,6 +691,7 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 
 	/* Add the UniqueKeys */
 	populate_unionrel_uniquekeys(root, result_rel);
+
 	return result_rel;
 }
 
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 3eec1f4d74..c9829c5fc4 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -755,6 +755,7 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 											 pseudoconstant,
 											 rinfo->security_level,
 											 NULL, NULL, NULL);
+			/* XXX This is a bit weird, doing this outside make_restrictinfo */
 			child_rinfo->mergeopfamilies = rinfo->mergeopfamilies;
 			childquals = lappend(childquals, child_rinfo);
 			/* track minimum security level among child quals */
-- 
2.30.2

0005-Extend-UniqueKeys-20210317.patchtext/x-patch; charset=UTF-8; name=0005-Extend-UniqueKeys-20210317.patchDownload

From e95046a65d561beaccf96c9b39f998354415a626 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:33:56 +0200
Subject: [PATCH 05/10] Extend UniqueKeys

Prepares index skip scan implementation using UniqueKeys. Allows to
specify what are the "requested" keys that should be unique, and add
them to necessary Paths to make them useful later.

Proposed by David Rowley, contains few bits out of previous version from
Jesper Pedersen.
---
 src/backend/optimizer/path/pathkeys.c   | 62 +++++++++++++++++++++++++
 src/backend/optimizer/path/uniquekeys.c | 61 ++++++++++++++++++++++++
 src/backend/optimizer/plan/planner.c    | 36 +++++++++++++-
 src/backend/optimizer/util/pathnode.c   | 32 +++++++++----
 src/include/nodes/pathnodes.h           |  5 ++
 src/include/optimizer/pathnode.h        |  1 +
 src/include/optimizer/paths.h           |  8 ++++
 7 files changed, 195 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 139278829b..4c90f63705 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
 #include "utils/lsyscache.h"
 
 
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
@@ -95,6 +96,29 @@ make_canonical_pathkey(PlannerInfo *root,
 	return pk;
 }
 
+/*
+ * pathkey_is_unique
+ *	   Checks if the new pathkey's equivalence class is the same as that of
+ *     any existing member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+	EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+	ListCell   *lc;
+
+	/* If same EC already is already in the list, then not unique */
+	foreach(lc, pathkeys)
+	{
+		PathKey    *old_pathkey = (PathKey *) lfirst(lc);
+
+		if (new_ec == old_pathkey->pk_eclass)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * pathkey_is_redundant
  *	   Is a pathkey redundant with one already in the given list?
@@ -1151,6 +1175,44 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
 	return pathkeys;
 }
 
+/*
+ * make_pathkeys_for_uniquekeyclauses
+ *		Generate a pathkeys list to be used for uniquekey clauses
+ */
+List *
+make_pathkeys_for_uniquekeys(PlannerInfo *root,
+							 List *sortclauses,
+							 List *tlist)
+{
+	List	   *pathkeys = NIL;
+	ListCell   *l;
+
+	foreach(l, sortclauses)
+	{
+		SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+		Expr	   *sortkey;
+		PathKey    *pathkey;
+
+		sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+		Assert(OidIsValid(sortcl->sortop));
+		pathkey = make_pathkey_from_sortop(root,
+										   sortkey,
+										   root->nullable_baserels,
+										   sortcl->sortop,
+										   sortcl->nulls_first,
+										   sortcl->tleSortGroupRef,
+										   true);
+
+		if (EC_MUST_BE_REDUNDANT(pathkey->pk_eclass))
+			continue;
+
+		if (pathkey_is_unique(pathkey, pathkeys))
+			pathkeys = lappend(pathkeys, pathkey);
+	}
+
+	return pathkeys;
+}
+
 /****************************************************************************
  *		PATHKEYS AND MERGECLAUSES
  ****************************************************************************/
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
index 114e8334f5..b9567ca5c8 100644
--- a/src/backend/optimizer/path/uniquekeys.c
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -1521,3 +1521,64 @@ add_combined_uniquekey(PlannerInfo *root,
 
 	return false;
 }
+
+List*
+build_uniquekeys(PlannerInfo *root, List *sortclauses)
+{
+	List *result = NIL;
+	List *sortkeys;
+	ListCell *l;
+	List *exprs = NIL;
+
+	sortkeys = make_pathkeys_for_uniquekeys(root,
+											sortclauses,
+											root->processed_tlist);
+
+	/* Create a uniquekey and add it to the list */
+	foreach(l, sortkeys)
+	{
+		PathKey    *pathkey = (PathKey *) lfirst(l);
+		EquivalenceClass *ec = pathkey->pk_eclass;
+		EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members));
+		exprs = lappend(exprs, mem->em_expr);
+	}
+
+	result = lappend(result, makeUniqueKey(exprs, false));
+
+	return result;
+}
+
+bool
+query_has_uniquekeys_for(PlannerInfo *root, List *pathuniquekeys,
+						 bool allow_multinulls)
+{
+	ListCell *lc;
+	ListCell *lc2;
+
+	/* root->query_uniquekeys are the requested DISTINCT clauses on query level
+	 * pathuniquekeys are the unique keys on current path.
+	 * All requested query_uniquekeys must be satisfied by the pathuniquekeys
+	 */
+	foreach(lc, root->query_uniquekeys)
+	{
+		UniqueKey *query_ukey = lfirst_node(UniqueKey, lc);
+		bool satisfied = false;
+		foreach(lc2, pathuniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc2);
+			if (ukey->multi_nullvals && !allow_multinulls)
+				continue;
+			if (list_length(ukey->exprs) == 0 &&
+				list_length(query_ukey->exprs) != 0)
+				continue;
+			if (list_is_subset(ukey->exprs, query_ukey->exprs))
+			{
+				satisfied = true;
+				break;
+			}
+		}
+		if (!satisfied)
+			return false;
+	}
+	return true;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f29b65c07b..62c0e46c49 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3626,12 +3626,18 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 
 	if (parse->distinctClause &&
 		grouping_is_sortable(parse->distinctClause))
+	{
 		root->distinct_pathkeys =
 			make_pathkeys_for_sortclauses(root,
 										  parse->distinctClause,
 										  tlist);
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
+	{
 		root->distinct_pathkeys = NIL;
+		root->query_uniquekeys = NIL;
+	}
 
 	root->sort_pathkeys =
 		make_pathkeys_for_sortclauses(root,
@@ -4851,13 +4857,19 @@ create_distinct_paths(PlannerInfo *root,
 			Path	   *path = (Path *) lfirst(lc);
 
 			if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
-			{
 				add_path(distinct_rel, (Path *)
 						 create_upper_unique_path(root, distinct_rel,
 												  path,
 												  list_length(root->distinct_pathkeys),
 												  numDistinctRows));
-			}
+		}
+
+		foreach(lc, input_rel->unique_pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+
+			if (query_has_uniquekeys_for(root, path->uniquekeys, false))
+				add_path(distinct_rel, path);
 		}
 
 		/* For explicit-sort case, always use the more rigorous clause */
@@ -7501,6 +7513,26 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		}
 	}
 
+	foreach(lc, rel->unique_pathlist)
+	{
+		Path	   *subpath = (Path *) lfirst(lc);
+
+		/* Shouldn't have any parameterized paths anymore */
+		Assert(subpath->param_info == NULL);
+
+		if (tlist_same_exprs)
+			subpath->pathtarget->sortgrouprefs =
+				scanjoin_target->sortgrouprefs;
+		else
+		{
+			Path	   *newpath;
+
+			newpath = (Path *) create_projection_path(root, rel, subpath,
+													  scanjoin_target);
+			lfirst(lc) = newpath;
+		}
+	}
+
 	/*
 	 * Now, if final scan/join target contains SRFs, insert ProjectSetPath(s)
 	 * atop each existing path.  (Note that this function doesn't look at the
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 69b83071cf..c995921c88 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -416,10 +416,10 @@ set_cheapest(RelOptInfo *parent_rel)
  * 'parent_rel' is the relation entry to which the path corresponds.
  * 'new_path' is a potential path for parent_rel.
  *
- * Returns nothing, but modifies parent_rel->pathlist.
+ * Returns modified pathlist.
  */
-void
-add_path(RelOptInfo *parent_rel, Path *new_path)
+static List *
+add_path_to(RelOptInfo *parent_rel, List *pathlist, Path *new_path)
 {
 	bool		accept_new = true;	/* unless we find a superior old path */
 	int			insert_at = 0;	/* where to insert new item */
@@ -440,7 +440,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	 * for more than one old path to be tossed out because new_path dominates
 	 * it.
 	 */
-	foreach(p1, parent_rel->pathlist)
+	foreach(p1, pathlist)
 	{
 		Path	   *old_path = (Path *) lfirst(p1);
 		bool		remove_old = false; /* unless new proves superior */
@@ -584,8 +584,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		 */
 		if (remove_old)
 		{
-			parent_rel->pathlist = foreach_delete_current(parent_rel->pathlist,
-														  p1);
+			pathlist = foreach_delete_current(pathlist, p1);
 
 			/*
 			 * Delete the data pointed-to by the deleted cell, if possible
@@ -612,8 +611,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	if (accept_new)
 	{
 		/* Accept the new path: insert it at proper place in pathlist */
-		parent_rel->pathlist =
-			list_insert_nth(parent_rel->pathlist, insert_at, new_path);
+		pathlist = list_insert_nth(pathlist, insert_at, new_path);
 	}
 	else
 	{
@@ -621,6 +619,23 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		if (!IsA(new_path, IndexPath))
 			pfree(new_path);
 	}
+
+	return pathlist;
+}
+
+void
+add_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->pathlist = add_path_to(parent_rel,
+									   parent_rel->pathlist, new_path);
+}
+
+void
+add_unique_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->unique_pathlist = add_path_to(parent_rel,
+											  parent_rel->unique_pathlist,
+											  new_path);
 }
 
 /*
@@ -2597,6 +2612,7 @@ create_projection_path(PlannerInfo *root,
 	pathnode->path.pathkeys = subpath->pathkeys;
 
 	pathnode->subpath = subpath;
+	pathnode->path.uniquekeys = subpath->uniquekeys;
 
 	/*
 	 * We might not need a separate Result node.  If the input plan node type
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 80561b1003..27ff639053 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -293,6 +293,7 @@ struct PlannerInfo
 
 	List	   *query_pathkeys; /* desired pathkeys for query_planner() */
 
+	List	   *query_uniquekeys; /* unique keys required for the query */
 	List	   *group_pathkeys; /* groupClause pathkeys, if any */
 	List	   *window_pathkeys;	/* pathkeys of bottom window, if any */
 	List	   *distinct_pathkeys;	/* distinctClause pathkeys, if any */
@@ -686,6 +687,7 @@ typedef struct RelOptInfo
 	List	   *pathlist;		/* Path structures */
 	List	   *ppilist;		/* ParamPathInfos used in pathlist */
 	List	   *partial_pathlist;	/* partial Paths */
+	List       *unique_pathlist;    /* unique Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
 	struct Path *cheapest_unique_path;
@@ -874,6 +876,7 @@ struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanskip;		/* can AM skip duplicate values? */
 	bool		amcanparallel;	/* does AM support parallel scan? */
 	bool		amcanmarkpos;	/* does AM support mark/restore? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
@@ -1194,6 +1197,8 @@ typedef struct Path
 
 	List	   *pathkeys;		/* sort ordering of path's output */
 	/* pathkeys is a List of PathKey nodes; see above */
+
+	List	   *uniquekeys;	/* the unique keys, or NIL if none */
 } Path;
 
 /* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 54f4b782fc..aa6c3e439e 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -27,6 +27,7 @@ extern int	compare_fractional_path_costs(Path *path1, Path *path2,
 										  double fraction);
 extern void set_cheapest(RelOptInfo *parent_rel);
 extern void add_path(RelOptInfo *parent_rel, Path *new_path);
+extern void add_unique_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_path_precheck(RelOptInfo *parent_rel,
 							  Cost startup_cost, Cost total_cost,
 							  List *pathkeys, Relids required_outer);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 1adf99c9ee..b571ddec11 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -222,6 +222,9 @@ extern List *build_join_pathkeys(PlannerInfo *root,
 extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
 										   List *sortclauses,
 										   List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+										  List *sortclauses,
+										  List *tlist);
 extern void initialize_mergeclause_eclasses(PlannerInfo *root,
 											RestrictInfo *restrictinfo);
 extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -289,6 +292,11 @@ extern bool relation_has_uniquekeys_for(PlannerInfo *root,
 										RelOptInfo *rel,
 										List *exprs,
 										bool allow_multinulls);
+extern bool query_has_uniquekeys_for(PlannerInfo *root,
+									 List *exprs,
+									 bool allow_multinulls);
 extern bool relation_is_onerow(RelOptInfo *rel);
 
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+
 #endif							/* PATHS_H */
-- 
2.30.2

0006-review-20210317.patchtext/x-patch; charset=UTF-8; name=0006-review-20210317.patchDownload

From 46851bafabc7b543bcb144625dd6fa8b4290b322 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Wed, 17 Mar 2021 02:19:56 +0100
Subject: [PATCH 06/10] review

---
 src/backend/optimizer/path/uniquekeys.c | 2 ++
 src/backend/optimizer/plan/planner.c    | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
index b9567ca5c8..0e6c826fcd 100644
--- a/src/backend/optimizer/path/uniquekeys.c
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -1522,6 +1522,7 @@ add_combined_uniquekey(PlannerInfo *root,
 	return false;
 }
 
+/* FIXME comment */
 List*
 build_uniquekeys(PlannerInfo *root, List *sortclauses)
 {
@@ -1548,6 +1549,7 @@ build_uniquekeys(PlannerInfo *root, List *sortclauses)
 	return result;
 }
 
+/* FIXME comment */
 bool
 query_has_uniquekeys_for(PlannerInfo *root, List *pathuniquekeys,
 						 bool allow_multinulls)
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 62c0e46c49..f66b9d4898 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4857,11 +4857,13 @@ create_distinct_paths(PlannerInfo *root,
 			Path	   *path = (Path *) lfirst(lc);
 
 			if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
+			{
 				add_path(distinct_rel, (Path *)
 						 create_upper_unique_path(root, distinct_rel,
 												  path,
 												  list_length(root->distinct_pathkeys),
 												  numDistinctRows));
+			}
 		}
 
 		foreach(lc, input_rel->unique_pathlist)
-- 
2.30.2

0007-Index-skip-scan-20210317.patchtext/x-patch; charset=UTF-8; name=0007-Index-skip-scan-20210317.patchDownload

From fb41ff525bb23fc20d87049160683c383d1c8a43 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:39:13 +0200
Subject: [PATCH 07/10] Index skip scan

Implementation of basic Index Skip Scan (see Loose Index Scan in the
wiki [1]) infrastructure on top of IndexOnlyScan and IndexScan.
Introduces a new index am function amskip to allow advance past
duplicate key values in a scan. This innocently looking description
could be a bit tricky on the edge between am specific and common parts
of the implementation, mostly due to different information available at
each level, e.g. visibility. This means the common parts should apply
skipping multiple times if necessary.

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 contrib/bloom/blutils.c                       |   1 +
 doc/src/sgml/indexam.sgml                     |   1 +
 src/backend/access/brin/brin.c                |   1 +
 src/backend/access/gin/ginutil.c              |   1 +
 src/backend/access/gist/gist.c                |   1 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/index/indexam.c            |  18 ++
 src/backend/access/spgist/spgutils.c          |   1 +
 src/backend/commands/explain.c                |  25 +++
 src/backend/executor/nodeIndexonlyscan.c      |  97 ++++++++-
 src/backend/executor/nodeIndexscan.c          |  56 ++++-
 src/backend/nodes/copyfuncs.c                 |   2 +
 src/backend/nodes/outfuncs.c                  |   2 +
 src/backend/nodes/readfuncs.c                 |   2 +
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/optimizer/path/indxpath.c         | 196 +++++++++++++++++-
 src/backend/optimizer/path/pathkeys.c         |  54 ++++-
 src/backend/optimizer/plan/createplan.c       |  20 +-
 src/backend/optimizer/util/pathnode.c         |  37 ++++
 src/backend/optimizer/util/plancat.c          |   1 +
 src/backend/utils/misc/guc.c                  |   9 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/amapi.h                    |   7 +
 src/include/access/genam.h                    |   2 +
 src/include/access/sdir.h                     |   7 +
 src/include/nodes/execnodes.h                 |   6 +
 src/include/nodes/pathnodes.h                 |   4 +
 src/include/nodes/plannodes.h                 |   4 +
 src/include/optimizer/cost.h                  |   1 +
 src/include/optimizer/pathnode.h              |   4 +
 src/include/optimizer/paths.h                 |   5 +-
 src/test/regress/expected/sysviews.out        |   3 +-
 32 files changed, 554 insertions(+), 17 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 1e505b1da5..a58bdf7604 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -134,6 +134,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = blcostestimate;
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index ec5741df6d..3442ae816b 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -151,6 +151,7 @@ typedef struct IndexAmRoutine
     amendscan_function amendscan;
     ammarkpos_function ammarkpos;       /* can be NULL */
     amrestrpos_function amrestrpos;     /* can be NULL */
+    amskip_function amskip;             /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 27ba596c6e..1ec10ec513 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -115,6 +115,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = brincostestimate;
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b9b04cf42..d776d2732f 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -66,6 +66,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = gincostestimate;
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..fed061184e 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -87,6 +87,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = gistcostestimate;
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a9..fd7c13ee4c 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -84,6 +84,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = hashcostestimate;
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 3d2dbed708..2544ea24f1 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_skip		- advance past duplicate key values in a scan
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -739,6 +740,23 @@ index_can_return(Relation indexRelation, int attno)
 	return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
 }
 
+/* ----------------
+ *		index_skip
+ *
+ *		Skip past all tuples where the first 'prefix' columns have the
+ *		same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction,
+		   ScanDirection indexdir, bool scanstart, int prefix)
+{
+	SCAN_CHECKS;
+
+	return scan->indexRelation->rd_indam->amskip(scan, direction,
+												 indexdir, prefix);
+}
+
 /* ----------------
  *		index_getprocid
  *
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index d8b1815061..b2ed0712f2 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -69,6 +69,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = spgcostestimate;
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index afc45429ba..a160de5493 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -149,6 +149,7 @@ static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
 static void ExplainIndentText(ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 
 
@@ -1098,6 +1099,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
 }
 
+/*
+ * ExplainIndexSkipScanKeys -
+ *	  Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es)
+{
+	if (skipPrefixSize > 0)
+	{
+		if (es->format != EXPLAIN_FORMAT_TEXT)
+			ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+	}
+}
+
 /*
  * ExplainNode -
  *	  Appends a description of a plan tree to es->str
@@ -1439,6 +1456,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexScan  *indexscan = (IndexScan *) plan;
 
+				ExplainIndexSkipScanKeys(indexscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexscan->indexid,
 										indexscan->indexorderdir,
 										es);
@@ -1449,6 +1468,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
 
+				ExplainIndexSkipScanKeys(indexonlyscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexonlyscan->indexid,
 										indexonlyscan->indexorderdir,
 										es);
@@ -1709,6 +1730,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_IndexScan:
+			if (((IndexScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyBool("Skip scan", true, es);
 			show_scan_qual(((IndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexScan *) plan)->indexqualorig)
@@ -1722,6 +1745,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 										   planstate, es);
 			break;
 		case T_IndexOnlyScan:
+			if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyBool("Skip scan", true, es);
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 0754e28a9a..0fad258202 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -41,6 +41,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/itemptr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -62,9 +63,26 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	EState	   *estate;
 	ExprContext *econtext;
 	ScanDirection direction;
+	ScanDirection readDirection;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	ItemPointerData startTid;
+	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
+
+	/*
+	 * Tells if the current position was reached via skipping. In this case
+	 * there is no nead for the index_getnext_tid
+	 */
+	bool skipped = false;
+
+	/*
+	 * Index only scan must be aware that in case of skipping we can return to
+	 * the starting point due to visibility checks. In this situation we need
+	 * to jump further, and number of skipping attempts tell us how far do we
+	 * need to do so.
+	 */
+	int skipAttempts = 0;
 
 	/*
 	 * extract necessary information from index scan node
@@ -72,7 +90,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -114,16 +132,87 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
 	}
+	else
+	{
+		ItemPointerCopy(&scandesc->xs_heaptid, &startTid);
+	}
+
+	/*
+	 * Check if we need to skip to the next key prefix, because we've been
+	 * asked to implement DISTINCT.
+	 *
+	 * When fetching a cursor in the direction opposite to a general scan
+	 * direction, the result must be what normal fetching should have
+	 * returned, but in reversed order. In other words, return the last or
+	 * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+	 * Due to that we skip also when the first tuple wasn't emitted yet, but
+	 * the directions are opposite.
+	 */
+	if (node->ioss_SkipPrefixSize > 0 &&
+		(node->ioss_FirstTupleEmitted ||
+		 ScanDirectionsAreOpposite(direction, indexonlyscan->indexorderdir)))
+	{
+		if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir,
+						!node->ioss_FirstTupleEmitted, node->ioss_SkipPrefixSize))
+		{
+			/*
+			 * Reached end of index. At this point currPos is invalidated, and
+			 * we need to reset ioss_FirstTupleEmitted, since otherwise after
+			 * going backwards, reaching the end of index, and going forward
+			 * again we apply skip again. It would be incorrect and lead to an
+			 * extra skipped item.
+			 */
+			node->ioss_FirstTupleEmitted = false;
+			return ExecClearTuple(slot);
+		}
+		else
+		{
+			skipAttempts = 1;
+			skipped = true;
+			tid = &scandesc->xs_heaptid;
+		}
+	}
+
+	readDirection = skipped ? indexonlyscan->indexorderdir : direction;
 
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (skipped || (tid = index_getnext_tid(scandesc, readDirection)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * If we already emitted first tuple, while doing index only skip scan
+		 * with advancing and reading in different directions we can return to
+		 * the same position where we started after visibility check. Recognize
+		 * such situations and skip more.
+		 */
+		if ((readDirection != direction) && node->ioss_FirstTupleEmitted &&
+			ItemPointerIsValid(&startTid) && ItemPointerEquals(&startTid, tid))
+		{
+			int i;
+			skipAttempts += 1;
+
+			for (i = 0; i < skipAttempts; i++)
+			{
+				if (!index_skip(scandesc, direction,
+								indexonlyscan->indexorderdir,
+								!node->ioss_FirstTupleEmitted,
+								node->ioss_SkipPrefixSize))
+				{
+					node->ioss_FirstTupleEmitted = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			tid = &scandesc->xs_heaptid;
+		}
+
+		skipped = false;
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -250,6 +339,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 							  ItemPointerGetBlockNumber(tid),
 							  estate->es_snapshot);
 
+		node->ioss_FirstTupleEmitted = true;
+
 		return slot;
 	}
 
@@ -504,6 +595,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+	indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->ioss_FirstTupleEmitted = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 2fffb1b437..71aac4493d 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,6 +85,13 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
+
+	/*
+	 * tells if the current position was reached via skipping. In this case
+	 * there is no nead for the index_getnext_tid
+	 */
+	bool skipped = false;
 
 	/*
 	 * extract necessary information from index scan node
@@ -92,7 +99,7 @@ IndexNext(IndexScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -117,6 +124,12 @@ IndexNext(IndexScanState *node)
 
 		node->iss_ScanDesc = scandesc;
 
+		/* Index skip scan assumes xs_want_itup, so set it to true */
+		if (indexscan->indexskipprefixsize > 0)
+			node->iss_ScanDesc->xs_want_itup = true;
+		else
+			node->iss_ScanDesc->xs_want_itup = false;
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -127,12 +140,48 @@ IndexNext(IndexScanState *node)
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	}
 
+	/*
+	 * Check if we need to skip to the next key prefix, because we've been
+	 * asked to implement DISTINCT.
+	 *
+	 * When fetching a cursor in the direction opposite to a general scan
+	 * direction, the result must be what normal fetching should have
+	 * returned, but in reversed order. In other words, return the last or
+	 * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+	 * Due to that we skip also when the first tuple wasn't emitted yet, but
+	 * the directions are opposite.
+	 */
+	if (node->iss_SkipPrefixSize > 0 &&
+		(node->iss_FirstTupleEmitted ||
+		 ScanDirectionsAreOpposite(direction, indexscan->indexorderdir)))
+	{
+		if (!index_skip(scandesc, direction, indexscan->indexorderdir,
+					   !node->iss_FirstTupleEmitted, node->iss_SkipPrefixSize))
+		{
+			/*
+			 * Reached end of index. At this point currPos is invalidated, and
+			 * we need to reset iss_FirstTupleEmitted, since otherwise after
+			 * going backwards, reaching the end of index, and going forward
+			 * again we apply skip again. It would be incorrect and lead to an
+			 * extra skipped item.
+			 */
+			node->iss_FirstTupleEmitted = false;
+			return ExecClearTuple(slot);
+		}
+		else
+		{
+			skipped = true;
+			index_fetch_heap(scandesc, slot);
+		}
+	}
+
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (skipped || index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
+		skipped = false;
 
 		/*
 		 * If the index was lossy, we have to recheck the index quals using
@@ -149,6 +198,7 @@ IndexNext(IndexScanState *node)
 			}
 		}
 
+		node->iss_FirstTupleEmitted = true;
 		return slot;
 	}
 
@@ -910,6 +960,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+	indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->iss_FirstTupleEmitted = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 9d832ddc03..78500efd7d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -491,6 +491,7 @@ _copyIndexScan(const IndexScan *from)
 	COPY_NODE_FIELD(indexorderbyorig);
 	COPY_NODE_FIELD(indexorderbyops);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
@@ -516,6 +517,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
 	COPY_NODE_FIELD(indexorderby);
 	COPY_NODE_FIELD(indextlist);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 13905e6037..2b6c007398 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -560,6 +560,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
 	WRITE_NODE_FIELD(indexorderbyorig);
 	WRITE_NODE_FIELD(indexorderbyops);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
@@ -574,6 +575,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
 	WRITE_NODE_FIELD(indexorderby);
 	WRITE_NODE_FIELD(indextlist);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 8830c8df99..8a207332fc 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1872,6 +1872,7 @@ _readIndexScan(void)
 	READ_NODE_FIELD(indexorderbyorig);
 	READ_NODE_FIELD(indexorderbyops);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
@@ -1891,6 +1892,7 @@ _readIndexOnlyScan(void)
 	READ_NODE_FIELD(indexorderby);
 	READ_NODE_FIELD(indextlist);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a25b674a19..658bacabf4 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -132,6 +132,7 @@ int			max_parallel_workers_per_gather = 2;
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
+bool		enable_indexskipscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index ff536e6b24..bd0a073998 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -784,6 +784,16 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexPath  *ipath = (IndexPath *) lfirst(lc);
 
+		/*
+		 * To prevent unique paths from index skip scans being potentially used
+		 * when not needed scan keep them in a separate pathlist.
+		*/
+		if (ipath->indexskipprefix != 0)
+		{
+			add_unique_path(rel, (Path *) ipath);
+			continue;
+		}
+
 		if (index->amhasgettuple)
 			add_path(rel, (Path *) ipath);
 
@@ -866,12 +876,15 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	double		loop_count;
 	List	   *orderbyclauses;
 	List	   *orderbyclausecols;
-	List	   *index_pathkeys;
+	List	   *index_pathkeys = NIL;
 	List	   *useful_pathkeys;
+	List	   *index_pathkeys_pos = NIL;
 	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
+	bool		not_empty_qual = false;
+	bool		can_skip;
 	int			indexcol;
 
 	/*
@@ -989,7 +1002,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	if (index_is_ordered && pathkeys_possibly_useful)
 	{
 		index_pathkeys = build_index_pathkeys(root, index,
-											  ForwardScanDirection);
+											  ForwardScanDirection,
+											  &index_pathkeys_pos);
 		useful_pathkeys = truncate_useless_pathkeys(root, rel,
 													index_pathkeys);
 		orderbyclauses = NIL;
@@ -1021,6 +1035,129 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	index_only_scan = (scantype != ST_BITMAPSCAN &&
 					   check_index_only(rel, index));
 
+	/* Check if an index skip scan is possible. */
+	can_skip = enable_indexskipscan & index->amcanskip;
+
+	if (can_skip)
+	{
+		/*
+		 * Skip scan is not supported when there are qual conditions, which are
+		 * not covered by index. The reason for that is that those conditions
+		 * are evaluated later, already after skipping was applied.
+		 *
+		 * TODO: This implementation is too restrictive, and doesn't allow e.g.
+		 * index expressions. For that we need to examine index_clauses too.
+		 */
+		if (root->parse->jointree != NULL)
+		{
+			ListCell *lc;
+
+			foreach(lc, (List *) root->parse->jointree->quals)
+			{
+				Node *expr, *qual = (Node *) lfirst(lc);
+				OpExpr *expr_op;
+				Var *var;
+				bool found = false;
+
+				if (!is_opclause(qual))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				expr = get_leftop(qual);
+				expr_op = (OpExpr *) qual;
+
+				if (!IsA(expr, Var))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				var = (Var *) expr;
+
+				/*
+				 * Check if the qual operator is indexable by any columns of
+				 * the index, test collation and opfamily.
+				 */
+				for (int i = 0; i < index->ncolumns; i++)
+				{
+					if (index->indexkeys[i] == var->varattno &&
+						IndexCollMatchesExprColl(index->indexcollations[i],
+												 expr_op->inputcollid) &&
+						op_in_opfamily(expr_op->opno, index->opfamily[i]))
+					{
+						found = true;
+						break;
+					}
+				}
+
+				if (!found)
+				{
+					not_empty_qual = true;
+					break;
+				}
+			}
+		}
+
+		/*
+		 * For an index scan verify that index fully covers distinct
+		 * expressions, otherwise there is not enough information for skipping
+		 */
+		if (!index_only_scan && root->query_uniquekeys != NULL)
+		{
+			ListCell *lc;
+
+			foreach(lc, root->query_uniquekeys)
+			{
+				UniqueKey *uniqueKey = (UniqueKey *) lfirst(lc);
+				ListCell *lc1;
+
+				foreach(lc1, uniqueKey->exprs)
+				{
+					Expr *expr = (Expr *) lfirst(lc1);
+					bool found = false;
+
+					if (!IsA(expr, Var))
+					{
+						ListCell *lc2;
+
+						foreach(lc2, index->indexprs)
+						{
+							if(equal(lfirst(lc1), lfirst(lc2)))
+							{
+								found = true;
+								break;
+							}
+						}
+					}
+					else
+					{
+						Var *var = (Var *) expr;
+
+						for (int i = 0; i < index->ncolumns; i++)
+						{
+							if (index->indexkeys[i] == var->varattno)
+							{
+								found = true;
+								break;
+							}
+						}
+					}
+
+					if (!found)
+					{
+						can_skip = false;
+						break;
+					}
+				}
+
+				if (!can_skip)
+					break;
+			}
+		}
+	}
+
 	/*
 	 * 4. Generate an indexscan path if there are relevant restriction clauses
 	 * in the current clauses, OR the index ordering is potentially useful for
@@ -1044,6 +1181,32 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  false);
 		result = lappend(result, ipath);
 
+		/* Consider index skip scan as well */
+		if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
+		{
+			int numusefulkeys = list_length(useful_pathkeys);
+			int numsortkeys = list_length(root->query_pathkeys);
+
+			if (numusefulkeys == numsortkeys)
+			{
+				int prefix;
+				if (list_length(root->distinct_pathkeys) > 0)
+					prefix = find_index_prefix_for_pathkey(index_pathkeys,
+														   index_pathkeys_pos,
+														   llast_node(PathKey,
+														   root->distinct_pathkeys));
+				else
+					/* all are distinct keys are constant and optimized away.
+					 * skipping with 1 is sufficient as all are constant anyway
+					 */
+					prefix = 1;
+
+				result = lappend(result,
+								 create_skipscan_unique_path(root, index,
+															 (Path *) ipath, prefix));
+			}
+		}
+
 		/*
 		 * If appropriate, consider parallel index scan.  We don't allow
 		 * parallel index scan for bitmap index scans.
@@ -1082,7 +1245,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	if (index_is_ordered && pathkeys_possibly_useful)
 	{
 		index_pathkeys = build_index_pathkeys(root, index,
-											  BackwardScanDirection);
+											  BackwardScanDirection,
+											  &index_pathkeys_pos);
 		useful_pathkeys = truncate_useless_pathkeys(root, rel,
 													index_pathkeys);
 		if (useful_pathkeys != NIL)
@@ -1099,6 +1263,32 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  false);
 			result = lappend(result, ipath);
 
+			/* Consider index skip scan as well */
+			if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
+			{
+				int numusefulkeys = list_length(useful_pathkeys);
+				int numsortkeys = list_length(root->query_pathkeys);
+
+				if (numusefulkeys == numsortkeys)
+				{
+					int prefix;
+					if (list_length(root->distinct_pathkeys) > 0)
+						prefix = find_index_prefix_for_pathkey(index_pathkeys,
+															   index_pathkeys_pos,
+															   llast_node(PathKey,
+															   root->distinct_pathkeys));
+					else
+						/* all are distinct keys are constant and optimized away.
+						 * skipping with 1 is sufficient as all are constant anyway
+						 */
+						prefix = 1;
+
+					result = lappend(result,
+									 create_skipscan_unique_path(root, index,
+																 (Path *) ipath, prefix));
+				}
+			}
+
 			/* If appropriate, consider parallel index scan */
 			if (index->amcanparallel &&
 				rel->consider_parallel && outer_relids == NULL &&
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 4c90f63705..e15637b514 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -522,6 +522,47 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  *		NEW PATHKEY FORMATION
  ****************************************************************************/
 
+/*
+ * Find the prefix size for a specific path key in an index. For example, an
+ * index with (a,b,c) finding path key b will return prefix 2. Optionally
+ * pathkeys_positions can be provided, to specify at which position in the
+ * original pathkey list this particular key could be found (this is helpful
+ * when we deal with redundant pathkeys).
+ *
+ * Returns 0 when not found.
+ */
+int
+find_index_prefix_for_pathkey(List *index_pathkeys,
+							  List *pathkeys_positions,
+							  PathKey *target_pathkey)
+{
+	ListCell   *lc;
+	int			i;
+
+	i = 0;
+	foreach(lc, index_pathkeys)
+	{
+		PathKey    *cpathkey = (PathKey *) lfirst(lc);
+
+		if (cpathkey == target_pathkey)
+		{
+			/*
+			 * Prefix expected to start from 1, increment positions since
+			 * they're 0 based.
+			 */
+			if (pathkeys_positions != NIL &&
+				pathkeys_positions->length > i)
+				return list_nth_int(pathkeys_positions, i) + 1;
+			else
+				return i + 1;
+		}
+
+		i++;
+	}
+
+	return 0;
+}
+
 /*
  * build_index_pathkeys
  *	  Build a pathkeys list that describes the ordering induced by an index
@@ -534,7 +575,9 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  * We iterate only key columns of covering indexes, since non-key columns
  * don't influence index ordering.  The result is canonical, meaning that
  * redundant pathkeys are removed; it may therefore have fewer entries than
- * there are key columns in the index.
+ * there are key columns in the index. Since by removing redundant pathkeys the
+ * information about original position is lost, return it via positions
+ * argument.
  *
  * Another reason for stopping early is that we may be able to tell that
  * an index column's sort order is uninteresting for this query.  However,
@@ -545,7 +588,8 @@ get_cheapest_parallel_safe_total_inner(List *paths)
 List *
 build_index_pathkeys(PlannerInfo *root,
 					 IndexOptInfo *index,
-					 ScanDirection scandir)
+					 ScanDirection scandir,
+					 List **positions)
 {
 	List	   *retval = NIL;
 	ListCell   *lc;
@@ -554,6 +598,8 @@ build_index_pathkeys(PlannerInfo *root,
 	if (index->sortopfamily == NULL)
 		return NIL;				/* non-orderable index */
 
+	*positions = NIL;
+
 	i = 0;
 	foreach(lc, index->indextlist)
 	{
@@ -607,7 +653,11 @@ build_index_pathkeys(PlannerInfo *root,
 			 * for this query.  Add it to list, unless it's redundant.
 			 */
 			if (!pathkey_is_redundant(cpathkey, retval))
+			{
 				retval = lappend(retval, cpathkey);
+				*positions = lappend_int(*positions,
+										 foreach_current_index(lc));
+			}
 		}
 		else
 		{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 906cab7053..c2969a0279 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -181,12 +181,14 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 								 Oid indexid, List *indexqual, List *indexqualorig,
 								 List *indexorderby, List *indexorderbyorig,
 								 List *indexorderbyops,
-								 ScanDirection indexscandir);
+								 ScanDirection indexscandir,
+								 int skipprefix);
 static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 Index scanrelid, Oid indexid,
 										 List *indexqual, List *indexorderby,
 										 List *indextlist,
-										 ScanDirection indexscandir);
+										 ScanDirection indexscandir,
+										 int skipprefix);
 static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
 											  List *indexqual,
 											  List *indexqualorig);
@@ -2997,7 +2999,8 @@ create_indexscan_plan(PlannerInfo *root,
 												fixed_indexquals,
 												fixed_indexorderbys,
 												best_path->indexinfo->indextlist,
-												best_path->indexscandir);
+												best_path->indexscandir,
+												best_path->indexskipprefix);
 	else
 		scan_plan = (Scan *) make_indexscan(tlist,
 											qpqual,
@@ -3008,7 +3011,8 @@ create_indexscan_plan(PlannerInfo *root,
 											fixed_indexorderbys,
 											indexorderbys,
 											indexorderbyops,
-											best_path->indexscandir);
+											best_path->indexscandir,
+											best_path->indexskipprefix);
 
 	copy_generic_path_info(&scan_plan->plan, &best_path->path);
 
@@ -5340,7 +5344,8 @@ make_indexscan(List *qptlist,
 			   List *indexorderby,
 			   List *indexorderbyorig,
 			   List *indexorderbyops,
-			   ScanDirection indexscandir)
+			   ScanDirection indexscandir,
+			   int skipPrefixSize)
 {
 	IndexScan  *node = makeNode(IndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5357,6 +5362,7 @@ make_indexscan(List *qptlist,
 	node->indexorderbyorig = indexorderbyorig;
 	node->indexorderbyops = indexorderbyops;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
@@ -5369,7 +5375,8 @@ make_indexonlyscan(List *qptlist,
 				   List *indexqual,
 				   List *indexorderby,
 				   List *indextlist,
-				   ScanDirection indexscandir)
+				   ScanDirection indexscandir,
+				   int skipPrefixSize)
 {
 	IndexOnlyScan *node = makeNode(IndexOnlyScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5384,6 +5391,7 @@ make_indexonlyscan(List *qptlist,
 	node->indexorderby = indexorderby;
 	node->indextlist = indextlist;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index c995921c88..a768f43e22 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3029,6 +3029,43 @@ create_upper_unique_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_skipscan_unique_path
+ *	  Creates a pathnode the same as an existing IndexPath except based on
+ *	  skipping duplicate values.  This may or may not be cheaper than using
+ *	  create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root, IndexOptInfo *index,
+							Path *basepath, int prefix)
+{
+	IndexPath 	*pathnode = makeNode(IndexPath);
+	int 		numDistinctRows;
+	UniqueKey *ukey;
+
+	Assert(IsA(basepath, IndexPath));
+
+	/* We don't want to modify basepath, so make a copy. */
+	memcpy(pathnode, basepath, sizeof(IndexPath));
+
+	ukey = linitial_node(UniqueKey, root->query_uniquekeys);
+
+	Assert(prefix > 0);
+	pathnode->indexskipprefix = prefix;
+	pathnode->path.uniquekeys = root->query_uniquekeys;
+
+	numDistinctRows = estimate_num_groups(root, ukey->exprs,
+										  pathnode->path.rows,
+										  NULL);
+
+	pathnode->path.total_cost = pathnode->path.startup_cost * numDistinctRows;
+	pathnode->path.rows = numDistinctRows;
+
+	return pathnode;
+}
+
 /*
  * create_agg_path
  *	  Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index cd703e41ba..bac3cee9d2 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -281,6 +281,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanskip = (amroutine->amskip != NULL);
 			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 855076b1fd..6354958ed3 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -960,6 +960,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-skip-scan plans."),
+			NULL
+		},
+		&enable_indexskipscan,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index f46c2dd7a8..4eecf2d95d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -359,6 +359,7 @@
 #enable_hashjoin = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexskipscan = on
 #enable_material = on
 #enable_mergejoin = on
 #enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index d357ebb559..eacf890dbe 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -173,6 +173,12 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+								 ScanDirection dir,
+								 ScanDirection indexdir,
+								 int prefix);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -275,6 +281,7 @@ typedef struct IndexAmRoutine
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
 	amrestrpos_function amrestrpos; /* can be NULL */
+	amskip_function amskip;				/* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 4515401869..f14baadea0 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -183,6 +183,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
 												   IndexBulkDeleteResult *stats);
 extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction,
+					   ScanDirection indexdir, bool start, int prefix);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/sdir.h b/src/include/access/sdir.h
index 8154adf3b8..6690c2a0b0 100644
--- a/src/include/access/sdir.h
+++ b/src/include/access/sdir.h
@@ -55,4 +55,11 @@ typedef enum ScanDirection
 #define ScanDirectionIsForward(direction) \
 	((bool) ((direction) == ForwardScanDirection))
 
+/*
+ * ScanDirectionsAreOpposite
+ *		True iff scan directions are backward/forward or forward/backward.
+ */
+#define ScanDirectionsAreOpposite(dirA, dirB) \
+	((bool) (dirA != NoMovementScanDirection && dirA == -dirB))
+
 #endif							/* SDIR_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e31ad6204e..9982343e8a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1427,6 +1427,8 @@ typedef struct IndexScanState
 	ExprContext *iss_RuntimeContext;
 	Relation	iss_RelationDesc;
 	struct IndexScanDescData *iss_ScanDesc;
+	int         iss_SkipPrefixSize;
+	bool		iss_FirstTupleEmitted;
 
 	/* These are needed for re-checking ORDER BY expr ordering */
 	pairingheap *iss_ReorderQueue;
@@ -1456,6 +1458,8 @@ typedef struct IndexScanState
  *		TableSlot		   slot for holding tuples fetched from the table
  *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
+ *		SkipPrefixSize	   number of keys for skip-based DISTINCT
+ *		FirstTupleEmitted  has the first tuple been emitted
  * ----------------
  */
 typedef struct IndexOnlyScanState
@@ -1474,6 +1478,8 @@ typedef struct IndexOnlyScanState
 	struct IndexScanDescData *ioss_ScanDesc;
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
+	int         ioss_SkipPrefixSize;
+	bool		ioss_FirstTupleEmitted;
 	Size		ioss_PscanLen;
 } IndexOnlyScanState;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 27ff639053..01e68fe6db 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1241,6 +1241,9 @@ typedef struct Path
  * we need not recompute them when considering using the same index in a
  * bitmap index/heap scan (see BitmapHeapPath).  The costs of the IndexPath
  * itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
  *----------
  */
 typedef struct IndexPath
@@ -1253,6 +1256,7 @@ typedef struct IndexPath
 	ScanDirection indexscandir;
 	Cost		indextotalcost;
 	Selectivity indexselectivity;
+	int			indexskipprefix;
 } IndexPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 95292d7573..a23ea622ed 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -403,6 +403,8 @@ typedef struct IndexScan
 	List	   *indexorderbyorig;	/* the same in original form */
 	List	   *indexorderbyops;	/* OIDs of sort ops for ORDER BY exprs */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for distinct
+										 * scans */
 } IndexScan;
 
 /* ----------------
@@ -430,6 +432,8 @@ typedef struct IndexOnlyScan
 	List	   *indexorderby;	/* list of index ORDER BY exprs */
 	List	   *indextlist;		/* TargetEntry list describing index's cols */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for distinct
+										 * scans */
 } IndexOnlyScan;
 
 /* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 1be93be098..0f22fcef99 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
 extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index aa6c3e439e..872c858a60 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -210,6 +210,10 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
 												 Path *subpath,
 												 int numCols,
 												 double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+											  IndexOptInfo *index,
+											  Path *subpath,
+											  int prefix);
 extern AggPath *create_agg_path(PlannerInfo *root,
 								RelOptInfo *rel,
 								Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index b571ddec11..45a985e8c1 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -205,8 +205,11 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   Relids required_outer,
 													   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
+extern int find_index_prefix_for_pathkey(List *index_pathkeys,
+										 List *pathkey_positions,
+										 PathKey *target_pathkey);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
-								  ScanDirection scandir);
+								  ScanDirection scandir, List **positions);
 extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
 									  ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 6d048e309c..eda1b0a3b4 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -102,6 +102,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexskipscan           | on
  enable_material                | on
  enable_mergejoin               | on
  enable_nestloop                | on
@@ -113,7 +114,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(18 rows)
+(21 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
-- 
2.30.2

0008-review-20210317.patchtext/x-patch; charset=UTF-8; name=0008-review-20210317.patchDownload

From 4cb85510652270224c4bcbe6a5edc72b7beec893 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Wed, 17 Mar 2021 02:36:08 +0100
Subject: [PATCH 08/10] review

---
 src/backend/commands/explain.c       | 1 +
 src/backend/executor/nodeIndexscan.c | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a160de5493..3b0b38099f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1110,6 +1110,7 @@ ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es)
 {
 	if (skipPrefixSize > 0)
 	{
+		/* FIXME Why not to show this for TEXT output? */
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 			ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
 	}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 71aac4493d..720442b731 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -124,6 +124,8 @@ IndexNext(IndexScanState *node)
 
 		node->iss_ScanDesc = scandesc;
 
+		/* FIXME Is the else branch necessary? */
+
 		/* Index skip scan assumes xs_want_itup, so set it to true */
 		if (indexscan->indexskipprefixsize > 0)
 			node->iss_ScanDesc->xs_want_itup = true;
-- 
2.30.2

0009-Btree-implementation-of-skipping-20210317.patchtext/x-patch; charset=UTF-8; name=0009-Btree-implementation-of-skipping-20210317.patchDownload

From c47e9aa14e109840560c4cf9d9bd84fdf4a0bc04 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:45:08 +0200
Subject: [PATCH 09/10] Btree implementation of skipping

Implementation of Index Skip Scan for Btree. To make it suitable for
both situations when there are small number of distinct values and
significant amount of distinct values the following approach is taken -
instead of searching from the root for every value we're searching for
then first on the current page, and then if not found continue searching
from the root.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 src/backend/access/nbtree/nbtree.c            |  13 +
 src/backend/access/nbtree/nbtsearch.c         | 506 ++++++++++++-
 src/include/access/nbtree.h                   |   7 +
 src/test/regress/expected/join.out            |   3 +
 src/test/regress/expected/select_distinct.out | 681 ++++++++++++++++++
 src/test/regress/expected/sysviews.out        |   2 +-
 src/test/regress/sql/join.sql                 |   5 +
 src/test/regress/sql/select_distinct.sql      | 298 ++++++++
 8 files changed, 1513 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c02c4e7710..2751f2f09e 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -123,6 +123,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
+	amroutine->amskip = btskip;
 	amroutine->amcostestimate = btcostestimate;
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
@@ -374,6 +375,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	 */
 	so->currTuples = so->markTuples = NULL;
 
+	so->skipScanKey = NULL;
+
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
@@ -440,6 +443,16 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	_bt_preprocess_array_keys(scan);
 }
 
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction,
+	   ScanDirection indexdir, int prefix)
+{
+	return _bt_skip(scan, direction, indexdir, prefix);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d1177d8772..77e5b91b22 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -45,7 +45,11 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+											Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan,
+										   BTScanInsert key,
+										   Buffer buf);
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -1486,6 +1490,452 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 	return true;
 }
 
+/*
+ *  _bt_skip() -- Skip items that have the same prefix as the most recently
+ * 				  fetched index tuple.
+ *
+ * 		The current position is set so that a subsequent call to _bt_next will
+ * 		fetch the first tuple that differs in the leading 'prefix' keys.
+ *
+ * 		There are four different kinds of skipping (depending on dir and
+ * 		indexdir, that are important to distinguish, especially in the presense
+ * 		of an index condition:
+ *
+ * 		* Advancing forward and reading forward
+ * 			simple scan
+ *
+ * 		* Advancing forward and reading backward
+ * 			scan inside a cursor fetching backward, when skipping is necessary
+ * 			right from the start
+ *
+ * 		* Advancing backward and reading forward
+ * 			scan with order by desc inside a cursor fetching forward, when
+ * 			skipping is necessary right from the start
+ *
+ * 		* Advancing backward and reading backward
+ * 			simple scan with order by desc
+ *
+ *      The current page is searched for the next unique value. If none is found
+ *      we will do a scan from the root in order to find the next page with
+ *      a unique value.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir,
+		 ScanDirection indexdir, int prefix)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTStack stack;
+	Buffer buf;
+	OffsetNumber offnum;
+	BTScanPosItem *currItem;
+	Relation 	 indexRel = scan->indexRelation;
+	bool scanstart = !BTScanPosIsValid(so->currPos);
+
+	/* We want to return tuples, and we need a starting point */
+	Assert(scan->xs_want_itup);
+	Assert(scan->xs_itup);
+
+	if (so->numKilled > 0)
+		_bt_killitems(scan);
+
+	/* If skipScanKey is NULL then we initialize it with _bt_mkscankey */
+	if (so->skipScanKey == NULL)
+	{
+		so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+		so->skipScanKey->keysz = prefix;
+		so->skipScanKey->scantid = NULL;
+	}
+	so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+	_bt_update_skip_scankeys(scan, indexRel);
+
+	/* Check if the next unique key can be found within the current page.
+	 * Since we do not lock the current page between jumps, it's possible
+	 * that it was splitted since the last time we saw it. This is fine in
+	 * case of scanning forward, since page split to the right and we are
+	 * still on the left most page. In case of scanning backwards it's
+	 * possible to loose some pages and we need to remember the previous
+	 * page, and then follow the right link from the current page until we
+	 * find the original one.
+	 *
+	 * Since the whole idea of checking the current page is to protect
+	 * ourselves and make more performant statistic mismatch case when
+	 * there are too many distinct values for jumping, it's not clear if
+	 * the complexity of this solution in case of backward scan is
+	 * justified, so for now just avoid it.
+	 */
+	if (BufferIsValid(so->currPos.buf) && ScanDirectionIsForward(dir))
+	{
+		_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+
+		if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf))
+		{
+			bool keyFound = false;
+
+			offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+			/* Lock the page for SERIALIZABLE transactions */
+			PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+							  scan->xs_snapshot);
+
+			/* We know in which direction to look */
+			_bt_initialize_more_data(so, dir);
+
+			/* Now read the data */
+			keyFound = _bt_readpage(scan, dir, offnum);
+
+			_bt_unlockbuf(indexRel, so->currPos.buf);
+			ReleaseBuffer(so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+
+			if (keyFound)
+			{
+				/* set IndexTuple */
+				currItem = &so->currPos.items[so->currPos.itemIndex];
+				scan->xs_heaptid = currItem->heapTid;
+				scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+				return true;
+			}
+		}
+		else
+			_bt_unlockbuf(indexRel, so->currPos.buf);
+	}
+
+	if (BufferIsValid(so->currPos.buf))
+	{
+		ReleaseBuffer(so->currPos.buf);
+		so->currPos.buf = InvalidBuffer;
+	}
+
+	/*
+	 * We haven't found scan key within the current page, so let's scan from
+	 * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+	 * number
+	 */
+	stack = _bt_search(scan->indexRelation, so->skipScanKey,
+					   &buf, BT_READ, scan->xs_snapshot);
+	_bt_freestack(stack);
+	so->currPos.buf = buf;
+	offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+	/* Lock the page for SERIALIZABLE transactions */
+	PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+					  scan->xs_snapshot);
+
+	/* We know in which direction to look */
+	_bt_initialize_more_data(so, dir);
+
+	/*
+	 * Simplest case is when both directions are forward, when we are already
+	 * at the next distinct key at the beginning of the series (so everything
+	 * else would be done in _bt_readpage)
+	 *
+	 * The case when both directions are backwards is also simple, but we need
+	 * to go one step back, since we need a last element from the previous
+	 * series.
+	 */
+	if ((ScanDirectionIsBackward(dir) && ScanDirectionIsBackward(indexdir)) ||
+		(ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir) & scanstart))
+		 offnum = OffsetNumberPrev(offnum);
+
+	/*
+	 * Andvance backward but read forward. At this moment we are at the next
+	 * distinct key at the beginning of the series. In case if scan just
+	 * started, we can read forward without doing anything else. Otherwise
+	 * find previous distinct key and the beginning of it's series and read
+	 * forward from there. To do so, go back one step, perform binary search
+	 * to find the first item in the series and let _bt_readpage do everything
+	 * else.
+	 */
+	else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir) && !scanstart)
+	{
+		/* Reading forward means we expect to see more data on the right */
+		so->currPos.moreRight = true;
+
+		offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+		/* One step back to find a previous value */
+		_bt_readpage(scan, dir, offnum);
+
+		_bt_unlockbuf(indexRel, so->currPos.buf);
+		if (_bt_next(scan, dir))
+		{
+			_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+			_bt_update_skip_scankeys(scan, indexRel);
+
+			/*
+			 * And now find the last item from the sequence for the
+			 * current, value with the intention do OffsetNumberNext. As a
+			 * result we end up on a first element from the sequence.
+			 */
+			if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf))
+				offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+			else
+			{
+				if (BufferIsValid(so->currPos.buf))
+				{
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_bt_killitems(scan);
+
+					_bt_unlockbuf(indexRel, so->currPos.buf);
+					ReleaseBuffer(so->currPos.buf);
+					so->currPos.buf = InvalidBuffer;
+				}
+
+				stack = _bt_search(scan->indexRelation, so->skipScanKey,
+								   &buf, BT_READ, scan->xs_snapshot);
+				_bt_freestack(stack);
+				so->currPos.buf = buf;
+				offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+			}
+		}
+		else
+		{
+			pfree(so->skipScanKey);
+			so->skipScanKey = NULL;
+			return false;
+		}
+	}
+
+	/*
+	 * Advance forward but read backward. At this moment we are at the next
+	 * distinct key at the beginning of the series. In case if scan just
+	 * started, we can go one step back and read forward without doing
+	 * anything else. Otherwise find the next distinct key and the beginning
+	 * of it's series, go one step back and read backward from there.
+	 *
+	 * An interesting situation can happen if one of distinct keys do not pass
+	 * a corresponding index condition at all. In this case reading backward
+	 * can lead to a previous distinct key being found, creating a loop. To
+	 * avoid that check the value to be returned, and jump one more time if
+	 * it's the same as at the beginning. Note that we do not check visibility
+	 * here, and dead tuples could also lead to the same situation. This has to
+	 * be checked on the caller side.
+	 */
+	else if (ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir) && !scanstart)
+	{
+		IndexTuple 	startItup = CopyIndexTuple(scan->xs_itup);
+		bool 		nextFound = false;
+
+		/* Reading backwards means we expect to see more data on the left */
+		so->currPos.moreLeft = true;
+
+		while (!nextFound)
+		{
+			IndexTuple itup;
+			OffsetNumber jumpOffset;
+			CHECK_FOR_INTERRUPTS();
+
+			/*
+			 * Find a next index tuple to update scan key. It could be at
+			 * the end, so check for max offset
+			 */
+			if (!_bt_readpage(scan, ForwardScanDirection, offnum))
+			{
+				/*
+				 * There's no actually-matching data on this page.  Try to
+				 * advance to the next page. Return false if there's no
+				 * matching data at all.
+				 */
+				_bt_unlockbuf(indexRel, so->currPos.buf);
+				if (!_bt_steppage(scan, dir))
+				{
+					pfree(so->skipScanKey);
+					so->skipScanKey = NULL;
+					return false;
+				}
+				_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+			}
+
+			currItem = &so->currPos.items[so->currPos.firstItem];
+			itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+			scan->xs_itup = itup;
+
+			_bt_update_skip_scankeys(scan, indexRel);
+			if (BufferIsValid(so->currPos.buf))
+			{
+				/* Before leaving current page, deal with any killed items */
+				if (so->numKilled > 0)
+					_bt_killitems(scan);
+
+				_bt_unlockbuf(indexRel, so->currPos.buf);
+				ReleaseBuffer(so->currPos.buf);
+				so->currPos.buf = InvalidBuffer;
+			}
+
+			stack = _bt_search(scan->indexRelation, so->skipScanKey,
+							   &buf, BT_READ, scan->xs_snapshot);
+			_bt_freestack(stack);
+			so->currPos.buf = buf;
+
+			/*
+			 * We need to remember the original offset after the jump,
+			 * since in case of looping this would be the next starting
+			 * point
+			 */
+			jumpOffset = offnum = _bt_binsrch(scan->indexRelation,
+											  so->skipScanKey, buf);
+			offnum = OffsetNumberPrev(offnum);
+
+			if (!_bt_readpage(scan, indexdir, offnum))
+			{
+				/*
+				 * There's no actually-matching data on this page.  Try to
+				 * advance to the next page. Return false if there's no
+				 * matching data at all.
+				 */
+				_bt_unlockbuf(indexRel, so->currPos.buf);
+				if (!_bt_steppage(scan, indexdir))
+				{
+					pfree(so->skipScanKey);
+					so->skipScanKey = NULL;
+					return false;
+				}
+				_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+			}
+
+			currItem = &so->currPos.items[so->currPos.lastItem];
+			itup = CopyIndexTuple((IndexTuple)
+					(so->currTuples + currItem->tupleOffset));
+
+			/*
+			 * To check if we returned the same tuple, try to find a
+			 * startItup on the current page. For that we need to update
+			 * scankey to match the whole tuple and set nextkey to return
+			 * an exact tuple, not the next one. If the tuple we found in
+			 * this way is equal to what we wanted to return, it means we
+			 * are in the loop, return offnum to the original position and
+			 * jump further
+			 *
+			 * Note that to compare tids we need to keep the leaf pinned,
+			 * otherwise there is a danger of vacuum cleaning up relevant
+			 * tuples.
+			 */
+			scan->xs_itup = startItup;
+			_bt_update_skip_scankeys(scan, indexRel);
+
+			so->skipScanKey->keysz = IndexRelationGetNumberOfKeyAttributes(indexRel);
+			so->skipScanKey->nextkey = false;
+
+			if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf))
+			{
+				OffsetNumber maxoff, startOffset;
+				IndexTuple verifiedItup;
+				Page page = BufferGetPage(so->currPos.buf);
+				startOffset = _bt_binsrch(scan->indexRelation,
+										  so->skipScanKey,
+										  so->currPos.buf);
+
+				maxoff = PageGetMaxOffsetNumber(page);
+
+				/* Now read the data */
+				if (_bt_readpage(scan, ForwardScanDirection, startOffset))
+				{
+					ItemPointer resultTids, verifyTids;
+					int nresult = 1,
+						nverify = 1;
+
+					currItem = &so->currPos.items[so->currPos.itemIndex];
+					verifiedItup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+					/*
+					 * We need to keep in mind that tuples we deal with
+					 * could be also posting tuples and represent a list of
+					 * tids.
+					 */
+					if (BTreeTupleIsPosting(verifiedItup))
+					{
+						nverify = BTreeTupleGetNPosting(verifiedItup);
+						verifyTids = BTreeTupleGetPosting(verifiedItup);
+						for (int i = 1; i < nverify; i++)
+							verifyTids[i] = *BTreeTupleGetPostingN(verifiedItup, i);
+					}
+					else
+						verifyTids = &verifiedItup->t_tid;
+
+					if (BTreeTupleIsPosting(itup))
+					{
+						nresult = BTreeTupleGetNPosting(itup);
+						resultTids = BTreeTupleGetPosting(itup);
+						for (int i = 1; i < nresult; i++)
+							resultTids[i] = *BTreeTupleGetPostingN(itup, i);
+					}
+					else
+						resultTids = &itup->t_tid;
+
+					/* One not equal means they're not equal. */
+					for(int i = 0; i < nverify; i++)
+					{
+						for(int j = 0; j < nresult; j++)
+						{
+							if (!ItemPointerEquals(&resultTids[j], &verifyTids[i]))
+							{
+								nextFound = true;
+								break;
+							}
+						}
+					}
+
+					if (!nextFound)
+						offnum = jumpOffset;
+				}
+
+				if ((offnum > maxoff) && (so->currPos.nextPage == P_NONE))
+				{
+					_bt_unlockbuf(indexRel, so->currPos.buf);
+
+					BTScanPosUnpinIfPinned(so->currPos);
+					BTScanPosInvalidate(so->currPos);
+
+					pfree(so->skipScanKey);
+					so->skipScanKey = NULL;
+					return false;
+				}
+			}
+			else
+				/*
+				 * If startItup could be not found within the current page,
+				 * assume we found something new
+				 */
+				nextFound = true;
+
+			/* Return original scankey options */
+			so->skipScanKey->keysz = prefix;
+			so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+		}
+	}
+
+	/* Now read the data */
+	if (!_bt_readpage(scan, indexdir, offnum))
+	{
+		/*
+		 * There's no actually-matching data on this page.  Try to advance to
+		 * the next page.  Return false if there's no matching data at all.
+		 */
+		_bt_unlockbuf(indexRel, so->currPos.buf);
+		if (!_bt_steppage(scan, dir))
+		{
+			pfree(so->skipScanKey);
+			so->skipScanKey = NULL;
+			return false;
+		}
+	}
+	else
+		/* Drop the lock, and maybe the pin, on the current page */
+		_bt_unlockbuf(indexRel, so->currPos.buf);
+
+	/* And set IndexTuple */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_heaptid = currItem->heapTid;
+	scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+	so->currPos.moreLeft = true;
+	so->currPos.moreRight = true;
+
+	return true;
+}
+
 /*
  *	_bt_readpage() -- Load data from current index page into so->currPos
  *
@@ -2482,3 +2932,57 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 	so->numKilled = 0;			/* just paranoia */
 	so->markItemIndex = -1;		/* ditto */
 }
+
+/*
+ * _bt_update_skip_scankeys() -- set up a new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+	TupleDesc		itupdesc;
+	int			indnkeyatts,
+				i;
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	ScanKey			scankeys = so->skipScanKey->scankeys;
+
+	itupdesc = RelationGetDescr(indexRel);
+	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+	for (i = 0; i < indnkeyatts; i++)
+	{
+		Datum datum;
+		bool null;
+		int flags;
+
+		datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+		flags = (null ? SK_ISNULL : 0) |
+				(indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+		scankeys[i].sk_flags = flags;
+		scankeys[i].sk_argument = datum;
+	}
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ *
+ * Scankey nextkey will tell us if we need to find a current key or the next
+ * one, which affects whether or not it's ok to be equal to the page highkey.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, Buffer buf)
+{
+	OffsetNumber low, high;
+	Page page = BufferGetPage(buf);
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	int high_compare = key->nextkey ? 0 : 1;
+
+	low = P_FIRSTDATAKEY(opaque);
+	high = PageGetMaxOffsetNumber(page);
+
+	if (unlikely(high < low))
+		return false;
+
+	return (_bt_compare(scan->indexRelation, key, page, low) > 0 &&
+			_bt_compare(scan->indexRelation, key, page, high) < high_compare);
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 88eccfcb73..c3a35291ae 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1045,6 +1045,9 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* Work space for _bt_skip */
+	BTScanInsert	skipScanKey;	/* used to control skipping */
+
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -1207,6 +1210,8 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir,
+					 ScanDirection indexdir, int prefix);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 							   Snapshot snapshot);
 
@@ -1231,6 +1236,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
 extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir,
+				   ScanDirection indexdir, int prefix);
 extern bool btproperty(Oid index_oid, int attno,
 					   IndexAMProperty prop, const char *propname,
 					   bool *res, bool *isnull);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 5c7528c029..a52e25b3c3 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4530,6 +4530,8 @@ select d.* from d left join (select * from b group by b.id, b.c_id) s
          ->  Seq Scan on d
 (8 rows)
 
+-- disable index skip scan to prevent it interfering with the plan
+set enable_indexskipscan to off;
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
@@ -4547,6 +4549,7 @@ select d.* from d left join (select distinct * from b) s
          ->  Seq Scan on d
 (9 rows)
 
+set enable_indexskipscan to on;
 -- check join removal works when uniqueness of the join condition is enforced
 -- by a UNION
 explain (costs off)
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index 11c6f50fbf..862e983b55 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -306,3 +306,684 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
  t
 (1 row)
 
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+CREATE INDEX ON distinct_a ((a + 1));
+ANALYZE distinct_a;
+SELECT DISTINCT a FROM distinct_a;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+ a 
+---
+ 1
+(1 row)
+
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Index Only Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+(2 rows)
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- test index skip scan for expressions
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT (a + 1) FROM distinct_a;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Scan using distinct_a_expr_idx on distinct_a
+   Skip scan: true
+(2 rows)
+
+SELECT DISTINCT (a + 1) FROM distinct_a;
+ ?column? 
+----------
+        2
+        3
+        4
+        5
+        6
+(5 rows)
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: true
+   Index Cond: (b = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: true
+   Index Cond: (b = 2)
+(3 rows)
+
+DROP INDEX distinct_a_b_a;
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b 
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+                          QUERY PLAN                          
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: true
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+                              QUERY PLAN                               
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: true
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+ 2 | 1 | 10
+ 3 | 1 | 10
+ 4 | 1 | 10
+ 5 | 1 | 10
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+   Index Cond: (a = 1)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT *
+FROM distinct_a;
+          QUERY PLAN          
+------------------------------
+ HashAggregate
+   Group Key: a, b, c
+   ->  Seq Scan on distinct_a
+(3 rows)
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Unique
+   ->  Index Scan using distinct_a_a_b_idx on distinct_a
+         Index Cond: (b = 2)
+         Filter: (c = 10)
+(4 rows)
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+ a | a 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 3
+ 4 | 4
+ 5 | 5
+(5 rows)
+
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+ a | ?column? 
+---+----------
+ 1 |        1
+ 2 |        1
+ 3 |        1
+ 4 |        1
+ 5 |        1
+(5 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+FETCH FROM c;
+ a 
+---
+ 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a 
+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 |  9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+                                 QUERY PLAN                                 
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+   Skip scan: true
+   Index Cond: ((b >= 1) AND (c = 0))
+(3 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c 
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
+-- test tuple killing
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 1 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 5 | 1 | 1 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 5 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 1 | 1 | 1 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(5 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(5 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- test posting lists
+CREATE TABLE distinct_posting (a int, b int, c int);
+CREATE INDEX ON distinct_posting (a, b, c);
+INSERT INTO distinct_posting
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+INSERT INTO distinct_posting (
+    SELECT 1 as a, 1 as b, 1 AS c
+        FROM generate_series(1,1000) i
+);
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c FROM distinct_posting WHERE c = 2
+    ORDER BY a DESC, b DESC;
+    FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+COMMIT;
+-- test that quals are check for indexability before applied
+CREATE TABLE Indexable_quals (a text, b text, c text);
+CREATE INDEX ON indexable_quals (a, b, c);
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'yyy');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'yyy');
+SELECT DISTINCT ON (a, b)  a, b
+FROM indexable_quals WHERE c LIKE '%y%' AND a LIKE 'a%' AND b = 'b';
+ a  | b 
+----+---
+ a1 | b
+ a2 | b
+(2 rows)
+
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index eda1b0a3b4..9471c0a9d9 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -114,7 +114,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(21 rows)
+(19 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 6a209a27aa..bee79132ad 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1569,11 +1569,16 @@ explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
   on d.a = s.id;
 
+-- disable index skip scan to prevent it interfering with the plan
+set enable_indexskipscan to off;
+
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
   on d.a = s.id;
 
+set enable_indexskipscan to on;
+
 -- check join removal works when uniqueness of the join condition is enforced
 -- by a UNION
 explain (costs off)
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index 33102744eb..f07abbb0f7 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -135,3 +135,301 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
 SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
 SELECT 2 IS NOT DISTINCT FROM null as "no";
 SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+CREATE INDEX ON distinct_a ((a + 1));
+ANALYZE distinct_a;
+
+SELECT DISTINCT a FROM distinct_a;
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- test index skip scan for expressions
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT (a + 1) FROM distinct_a;
+SELECT DISTINCT (a + 1) FROM distinct_a;
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+DROP INDEX distinct_a_b_a;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT *
+FROM distinct_a;
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
+
+-- test tuple killing
+
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- test posting lists
+CREATE TABLE distinct_posting (a int, b int, c int);
+CREATE INDEX ON distinct_posting (a, b, c);
+INSERT INTO distinct_posting
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+INSERT INTO distinct_posting (
+    SELECT 1 as a, 1 as b, 1 AS c
+        FROM generate_series(1,1000) i
+);
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c FROM distinct_posting WHERE c = 2
+    ORDER BY a DESC, b DESC;
+    FETCH ALL FROM c;
+
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+-- test that quals are check for indexability before applied
+CREATE TABLE Indexable_quals (a text, b text, c text);
+CREATE INDEX ON indexable_quals (a, b, c);
+
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'yyy');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'yyy');
+
+SELECT DISTINCT ON (a, b)  a, b
+FROM indexable_quals WHERE c LIKE '%y%' AND a LIKE 'a%' AND b = 'b';
-- 
2.30.2

0010-Index-skip-scan-documentation-20210317.patchtext/x-patch; charset=UTF-8; name=0010-Index-skip-scan-documentation-20210317.patchDownload

From 2bf5adbe3cb0669bad145499428108e102d6ba36 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:45:21 +0200
Subject: [PATCH 10/10] Index skip scan documentation

Covers added GUC and the extenion of index am interface.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 doc/src/sgml/config.sgml  | 15 ++++++++++
 doc/src/sgml/indexam.sgml | 62 +++++++++++++++++++++++++++++++++++++++
 doc/src/sgml/indices.sgml | 23 +++++++++++++++
 3 files changed, 100 insertions(+)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index a218d78bef..da916634ab 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4886,6 +4886,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+      <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of index-skip-scan plan
+        types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+        <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-material" xreflabel="enable_material">
       <term><varname>enable_material</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 3442ae816b..0f211df7a1 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -769,6 +769,68 @@ amrestrpos (IndexScanDesc scan);
 
   <para>
 <programlisting>
+bool
+amskip (IndexScanDesc scan,
+        ScanDirection direction,
+        ScanDirection indexdir,
+        bool scanstart,
+        int prefix);
+</programlisting>
+  Skip past all tuples where the first 'prefix' columns have the same value as
+  the last tuple returned in the current scan. The arguments are:
+
+   <variablelist>
+    <varlistentry>
+     <term><parameter>scan</parameter></term>
+     <listitem>
+      <para>
+       Index scan information
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>direction</parameter></term>
+     <listitem>
+      <para>
+       The direction in which data is advancing.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>indexdir</parameter></term>
+     <listitem>
+      <para>
+        The index direction, in which data must be read.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>scanstart</parameter></term>
+     <listitem>
+      <para>
+        Whether or not it is a start of the scan.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>prefix</parameter></term>
+     <listitem>
+      <para>
+        Distinct prefix size.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+
+  </para>
+
+  <para>
+<programlisting>
 Size
 amestimateparallelscan (void);
 </programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 623962d1d8..6589cb80ba 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1295,6 +1295,29 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
    and later will recognize such cases and allow index-only scans to be
    generated, but older versions will not.
   </para>
+
+  <sect2 id="indexes-index-skip-scans">
+    <title>Index Skip Scans</title>
+
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index</primary>
+      <secondary>index-skip scans</secondary>
+    </indexterm>
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index-skip scan</primary>
+    </indexterm>
+
+    <para>
+     When the rows retrieved from an index scan are then deduplicated by
+     eliminating rows matching on a prefix of index keys (e.g. when using
+     <literal>SELECT DISTINCT</literal>), the planner will consider
+     skipping groups of rows with a matching key prefix. When a row with
+     a particular prefix is found, remaining rows with the same key prefix
+     are skipped.  The larger the number of rows with the same key prefix
+     rows (i.e. the lower the number of distinct key prefixes in the index),
+     the more efficient this is.
+    </para>
+  </sect2>
  </sect1>
 
 
-- 
2.30.2

#32

[1]: /messages/by-id/CAKU4AWpQjAqJwQ2X-aR9g3+ZHRzU1k8hNP7A+_mLuOv-n5aVKA@mail.gmail.com
[2]: /messages/by-id/CAKU4AWrU35c9g3cE15JmVwh6B2Hzf4hf7cZUkRsiktv7AKR3Ag@mail.gmail.com

9erthalion6@gmail.com

almost 5 years ago

In reply to: Tomas Vondra (#31)

Re: Index Skip Scan (new UniqueKeys)

On Wed, Mar 17, 2021 at 03:28:00AM +0100, Tomas Vondra wrote:
Hi,

I took a look at the new patch series, focusing mostly on the uniquekeys
part. It'd be a bit tedious to explain all the review comments here, so
attached is a patch series with a "review" patch for some of the parts.

Great, thanks.

Most of it is fairly small (corrections to comments etc.), I'll go over
the more serious part so that we can discuss it here. I'll keep it split
per parts of the original patch series.
I suggest looking for XXX and FIXME comments in all the review patches.

0001
----

....

0002
----

In fact both 0001 & 0002 belong to another thread, which these days
span [1]/messages/by-id/CAKU4AWpQjAqJwQ2X-aR9g3+ZHRzU1k8hNP7A+_mLuOv-n5aVKA@mail.gmail.com, [2]/messages/by-id/CAKU4AWrU35c9g3cE15JmVwh6B2Hzf4hf7cZUkRsiktv7AKR3Ag@mail.gmail.com. I've included them only because they happened to be a
dependency for index skip scan following David suggestions, sorry if
it's confusing.

At the same time the author behind 0001 & 0002 is present in this thread
as well, maybe Andy can answer these comments right here and better than me.

0003
----

Just some comments/whitespace.

0004
----

I wonder why we don't include this in explain TEXT format? Seems it
might make it harder to write regression tests for this? It's easier to
just check that we deduced the right unique key(s) than having to
construct an example where it actually changes the plan.

Yeah, good point. I believe originally it was like that to not make
explain too verbose for skip scans, but displaying prefix definitely
could be helpful for testing, so will do this (and address other
comments as well).

#33

Tomas Vondra

tomas.vondra@enterprisedb.com

almost 5 years ago

In reply to: Dmitry Dolgov (#32)

Re: Index Skip Scan (new UniqueKeys)

On 3/17/21 6:02 PM, Dmitry Dolgov wrote:

On Wed, Mar 17, 2021 at 03:28:00AM +0100, Tomas Vondra wrote:
Hi,

I took a look at the new patch series, focusing mostly on the uniquekeys
part. It'd be a bit tedious to explain all the review comments here, so
attached is a patch series with a "review" patch for some of the parts.

Great, thanks.

Most of it is fairly small (corrections to comments etc.), I'll go over
the more serious part so that we can discuss it here. I'll keep it split
per parts of the original patch series.
I suggest looking for XXX and FIXME comments in all the review patches.

0001
----

....

0002
----

In fact both 0001 & 0002 belong to another thread, which these days
span [1], [2]. I've included them only because they happened to be a
dependency for index skip scan following David suggestions, sorry if
it's confusing.

At the same time the author behind 0001 & 0002 is present in this thread
as well, maybe Andy can answer these comments right here and better than me.

Ah, sorry for the confusion. In that case the review comments probably
belong to the other threads, so we should move the discussion there.
It's not clear to me which of the threads is the right one.

0003
----

Just some comments/whitespace.

0004
----

I wonder why we don't include this in explain TEXT format? Seems it
might make it harder to write regression tests for this? It's easier to
just check that we deduced the right unique key(s) than having to
construct an example where it actually changes the plan.

Yeah, good point. I believe originally it was like that to not make
explain too verbose for skip scans, but displaying prefix definitely
could be helpful for testing, so will do this (and address other
comments as well).

Cool. Thanks.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#34

9erthalion6@gmail.com

over 4 years ago

In reply to: Tomas Vondra (#33)

5 attachment(s)

Re: Index Skip Scan (new UniqueKeys)

Hi,

Here is another take on the patch with a couple of changes:

* I've removed for now UniqueKeys parts. The interaction of skip scan &
unique keys patch was actually not that big, so the main difference is
that now the structure itself went away, a list of unique expressions
is used instead. All the suggestions about how this feature should
look like from the planning perspective are still there. On the one
hand it will allow to develop both patches independently and avoid
confusion for reviewers, on the other UniqueKeys could be easily
incorporated back when needed.

* Support for skipping in case of moving backward on demand (scroll
cursor) is moved into a separate patch. This is implemented via
returning false from IndexSupportsBackwardScan in case if it's a skip
scan node, which in turn adds Materialize node on top when needed. The
name SupportsBackwardScan was a bit confusing for me, but it seems
it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL.
Eventually those cases when BackwardScanDirection is used are still
handled by amskip. This change didn't affect the test coverage, all
the test cases supported in previous patch versions are still there.

About Materialize node, I guess it could be less performant than a
"native" support, but it simplifies the implementation significantly
to the point that most parts, which were causing questions before, are
now located in the isolated patch. My idea here is to concentrate
efforts on the first three patches in this series, and consider the
rest of them as an experiment field.

* IndexScan support was also relocated into a separate patch, the first
three patches are now only about IndexOnlyScan.

* Last bits of reviews were incorporated and rebased.

Attachments:

v39-0001-Unique-expressions.patchtext/x-diff; charset=us-asciiDownload

From 074fc8a41b43dd8b10ab29eb880c9d161a1638d5 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 17 May 2021 11:47:07 +0200
Subject: [PATCH v39 1/5] Unique expressions

Extend PlannerInfo and Path structures with the list of relevant unique
expressions. It specifies which keys must be unique on the query
level, and allows to leverage this into on the path level. At the moment
only distinctClause makes use of such mechanism, which enables potential
use of index skip scan.

Originally proposed by David Rowley, based on the UniqueKey patch
implementation from Andy Fan, contains few bits out of previous version
from Jesper Pedersen, Floris Van Nee.
---
 src/backend/nodes/list.c                | 31 +++++++++
 src/backend/optimizer/path/Makefile     |  3 +-
 src/backend/optimizer/path/pathkeys.c   | 62 +++++++++++++++++
 src/backend/optimizer/path/uniquekeys.c | 92 +++++++++++++++++++++++++
 src/backend/optimizer/plan/planner.c    | 36 +++++++++-
 src/backend/optimizer/util/pathnode.c   | 32 ++++++---
 src/include/nodes/pathnodes.h           |  5 ++
 src/include/nodes/pg_list.h             |  1 +
 src/include/optimizer/pathnode.h        |  1 +
 src/include/optimizer/paths.h           |  9 +++
 10 files changed, 261 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index 94fb236daf..9f2c408a4e 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -1537,3 +1537,34 @@ list_oid_cmp(const ListCell *p1, const ListCell *p2)
 		return 1;
 	return 0;
 }
+
+/*
+ * Return true iff every entry in "members" list is also present
+ * in the "target" list.
+ */
+bool
+list_is_subset(const List *members, const List *target)
+{
+	const ListCell	*lc1, *lc2;
+
+	Assert(IsPointerList(members));
+	Assert(IsPointerList(target));
+	check_list_invariants(members);
+	check_list_invariants(target);
+
+	foreach(lc1, members)
+	{
+		bool found = false;
+		foreach(lc2, target)
+		{
+			if (equal(lfirst(lc1), lfirst(lc2)))
+			{
+				found = true;
+				break;
+			}
+		}
+		if (!found)
+			return false;
+	}
+	return true;
+}
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..7b9820c25f 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekeys.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index bd9a176d7d..f28547148d 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
 #include "utils/lsyscache.h"
 
 
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
@@ -96,6 +97,29 @@ make_canonical_pathkey(PlannerInfo *root,
 	return pk;
 }
 
+/*
+ * pathkey_is_unique
+ *		Checks if the new pathkey's equivalence class is the same as that of
+ *		any existing member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+	EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+	ListCell   *lc;
+
+	/* If same EC already is already in the list, then not unique */
+	foreach(lc, pathkeys)
+	{
+		PathKey    *old_pathkey = (PathKey *) lfirst(lc);
+
+		if (new_ec == old_pathkey->pk_eclass)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * pathkey_is_redundant
  *	   Is a pathkey redundant with one already in the given list?
@@ -1152,6 +1176,44 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
 	return pathkeys;
 }
 
+/*
+ * make_pathkeys_for_uniquekeyclauses
+ *		Generate a pathkeys list to be used for uniquekey clauses
+ */
+List *
+make_pathkeys_for_uniquekeys(PlannerInfo *root,
+							 List *sortclauses,
+							 List *tlist)
+{
+	List	   *pathkeys = NIL;
+	ListCell   *l;
+
+	foreach(l, sortclauses)
+	{
+		SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+		Expr	   *sortkey;
+		PathKey    *pathkey;
+
+		sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+		Assert(OidIsValid(sortcl->sortop));
+		pathkey = make_pathkey_from_sortop(root,
+										   sortkey,
+										   root->nullable_baserels,
+										   sortcl->sortop,
+										   sortcl->nulls_first,
+										   sortcl->tleSortGroupRef,
+										   true);
+
+		if (EC_MUST_BE_REDUNDANT(pathkey->pk_eclass))
+			continue;
+
+		if (pathkey_is_unique(pathkey, pathkeys))
+			pathkeys = lappend(pathkeys, pathkey);
+	}
+
+	return pathkeys;
+}
+
 /****************************************************************************
  *		PATHKEYS AND MERGECLAUSES
  ****************************************************************************/
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
new file mode 100644
index 0000000000..d2525771e3
--- /dev/null
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -0,0 +1,92 @@
+/*-------------------------------------------------------------------------
+ *
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/uniquekeys.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "optimizer/paths.h"
+
+/*
+ * build_uniquekeys
+ * 		Preparing list of pathkeys keys which are considered to be unique for
+ * 		this query.
+ *
+ * For now used only for distinct clauses, where redundant keys	need to be
+ * preserved e.g. for skip scan. Justification for this function existence is
+ * future plans to make it produce actual UniqueKey list. 
+ */
+List*
+build_uniquekeys(PlannerInfo *root, List *sortclauses)
+{
+	List *result = NIL;
+	List *sortkeys;
+	ListCell *l;
+	List *exprs = NIL;
+
+	sortkeys = make_pathkeys_for_uniquekeys(root,
+											sortclauses,
+											root->processed_tlist);
+
+	/* Create a uniquekey and add it to the list */
+	foreach(l, sortkeys)
+	{
+		PathKey    *pathkey = (PathKey *) lfirst(l);
+		EquivalenceClass *ec = pathkey->pk_eclass;
+		EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members));
+		exprs = lappend(exprs, mem->em_expr);
+	}
+
+	result = lappend(result, exprs);
+
+	return result;
+}
+
+/*
+ * query_has_uniquekeys_for
+ * 		Check if the specified unique keys matching all query level unique
+ * 		keys.
+ *
+ * The main use is to verify that unique keys for some path are covering all
+ * requested query unique keys. Based on this information a path could be
+ * rejected if it satisfy uniqueness only partially.
+ */
+bool
+query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys,
+						 bool allow_multinulls)
+{
+	ListCell *lc;
+	ListCell *lc2;
+
+	/* root->query_uniquekeys are the requested DISTINCT clauses on query level
+	 * path_uniquekeys are the unique keys on current path. All requested
+	 * query_uniquekeys must be satisfied by the path_uniquekeys.
+	 */
+	foreach(lc, root->query_uniquekeys)
+	{
+		List *query_ukey = lfirst_node(List, lc);
+		bool satisfied = false;
+		foreach(lc2, path_uniquekeys)
+		{
+			List *ukey = lfirst_node(List, lc2);
+			if (list_length(ukey) == 0 &&
+				list_length(query_ukey) != 0)
+				continue;
+			if (list_is_subset(ukey, query_ukey))
+			{
+				satisfied = true;
+				break;
+			}
+		}
+		if (!satisfied)
+			return false;
+	}
+	return true;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 1868c4eff4..4b2bed7e68 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3094,12 +3094,18 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 
 	if (parse->distinctClause &&
 		grouping_is_sortable(parse->distinctClause))
+	{
 		root->distinct_pathkeys =
 			make_pathkeys_for_sortclauses(root,
 										  parse->distinctClause,
 										  tlist);
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
+	{
 		root->distinct_pathkeys = NIL;
+		root->query_uniquekeys = NIL;
+	}
 
 	root->sort_pathkeys =
 		make_pathkeys_for_sortclauses(root,
@@ -4314,13 +4320,19 @@ create_distinct_paths(PlannerInfo *root,
 			Path	   *path = (Path *) lfirst(lc);
 
 			if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
-			{
 				add_path(distinct_rel, (Path *)
 						 create_upper_unique_path(root, distinct_rel,
 												  path,
 												  list_length(root->distinct_pathkeys),
 												  numDistinctRows));
-			}
+		}
+
+		foreach(lc, input_rel->unique_pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+
+			if (query_has_uniquekeys_for(root, path->uniquekeys, false))
+				add_path(distinct_rel, path);
 		}
 
 		/* For explicit-sort case, always use the more rigorous clause */
@@ -6955,6 +6967,26 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		}
 	}
 
+	foreach(lc, rel->unique_pathlist)
+	{
+		Path	   *subpath = (Path *) lfirst(lc);
+
+		/* Shouldn't have any parameterized paths anymore */
+		Assert(subpath->param_info == NULL);
+
+		if (tlist_same_exprs)
+			subpath->pathtarget->sortgrouprefs =
+				scanjoin_target->sortgrouprefs;
+		else
+		{
+			Path	   *newpath;
+
+			newpath = (Path *) create_projection_path(root, rel, subpath,
+													  scanjoin_target);
+			lfirst(lc) = newpath;
+		}
+	}
+
 	/*
 	 * Now, if final scan/join target contains SRFs, insert ProjectSetPath(s)
 	 * atop each existing path.  (Note that this function doesn't look at the
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b248b038e0..e190b20675 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -416,10 +416,10 @@ set_cheapest(RelOptInfo *parent_rel)
  * 'parent_rel' is the relation entry to which the path corresponds.
  * 'new_path' is a potential path for parent_rel.
  *
- * Returns nothing, but modifies parent_rel->pathlist.
+ * Returns modified pathlist.
  */
-void
-add_path(RelOptInfo *parent_rel, Path *new_path)
+static List *
+add_path_to(RelOptInfo *parent_rel, List *pathlist, Path *new_path)
 {
 	bool		accept_new = true;	/* unless we find a superior old path */
 	int			insert_at = 0;	/* where to insert new item */
@@ -440,7 +440,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	 * for more than one old path to be tossed out because new_path dominates
 	 * it.
 	 */
-	foreach(p1, parent_rel->pathlist)
+	foreach(p1, pathlist)
 	{
 		Path	   *old_path = (Path *) lfirst(p1);
 		bool		remove_old = false; /* unless new proves superior */
@@ -584,8 +584,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		 */
 		if (remove_old)
 		{
-			parent_rel->pathlist = foreach_delete_current(parent_rel->pathlist,
-														  p1);
+			pathlist = foreach_delete_current(pathlist, p1);
 
 			/*
 			 * Delete the data pointed-to by the deleted cell, if possible
@@ -612,8 +611,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	if (accept_new)
 	{
 		/* Accept the new path: insert it at proper place in pathlist */
-		parent_rel->pathlist =
-			list_insert_nth(parent_rel->pathlist, insert_at, new_path);
+		pathlist = list_insert_nth(pathlist, insert_at, new_path);
 	}
 	else
 	{
@@ -621,6 +619,23 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		if (!IsA(new_path, IndexPath))
 			pfree(new_path);
 	}
+
+	return pathlist;
+}
+
+void
+add_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->pathlist = add_path_to(parent_rel,
+									   parent_rel->pathlist, new_path);
+}
+
+void
+add_unique_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->unique_pathlist = add_path_to(parent_rel,
+											  parent_rel->unique_pathlist,
+											  new_path);
 }
 
 /*
@@ -2648,6 +2663,7 @@ create_projection_path(PlannerInfo *root,
 	pathnode->path.pathkeys = subpath->pathkeys;
 
 	pathnode->subpath = subpath;
+	pathnode->path.uniquekeys = subpath->uniquekeys;
 
 	/*
 	 * We might not need a separate Result node.  If the input plan node type
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index a65bda7e3c..6c04ba2bb2 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -292,6 +292,7 @@ struct PlannerInfo
 
 	List	   *query_pathkeys; /* desired pathkeys for query_planner() */
 
+	List	   *query_uniquekeys; /* unique keys required for the query */
 	List	   *group_pathkeys; /* groupClause pathkeys, if any */
 	List	   *window_pathkeys;	/* pathkeys of bottom window, if any */
 	List	   *distinct_pathkeys;	/* distinctClause pathkeys, if any */
@@ -690,6 +691,7 @@ typedef struct RelOptInfo
 	List	   *pathlist;		/* Path structures */
 	List	   *ppilist;		/* ParamPathInfos used in pathlist */
 	List	   *partial_pathlist;	/* partial Paths */
+	List       *unique_pathlist;    /* unique Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
 	struct Path *cheapest_unique_path;
@@ -875,6 +877,7 @@ struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanskip;		/* can AM skip duplicate values? */
 	bool		amcanparallel;	/* does AM support parallel scan? */
 	bool		amcanmarkpos;	/* does AM support mark/restore? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
@@ -1187,6 +1190,8 @@ typedef struct Path
 
 	List	   *pathkeys;		/* sort ordering of path's output */
 	/* pathkeys is a List of PathKey nodes; see above */
+
+	List	   *uniquekeys;	/* the unique keys, or NIL if none */
 } Path;
 
 /* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/nodes/pg_list.h b/src/include/nodes/pg_list.h
index 30f98c4595..cdee029389 100644
--- a/src/include/nodes/pg_list.h
+++ b/src/include/nodes/pg_list.h
@@ -566,6 +566,7 @@ extern pg_nodiscard List *list_delete_first(List *list);
 extern pg_nodiscard List *list_delete_last(List *list);
 extern pg_nodiscard List *list_delete_nth_cell(List *list, int n);
 extern pg_nodiscard List *list_delete_cell(List *list, ListCell *cell);
+extern bool list_is_subset(const List *members, const List *target);
 
 extern List *list_union(const List *list1, const List *list2);
 extern List *list_union_ptr(const List *list1, const List *list2);
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 53261ee91f..b976f2963e 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -27,6 +27,7 @@ extern int	compare_fractional_path_costs(Path *path1, Path *path2,
 										  double fraction);
 extern void set_cheapest(RelOptInfo *parent_rel);
 extern void add_path(RelOptInfo *parent_rel, Path *new_path);
+extern void add_unique_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_path_precheck(RelOptInfo *parent_rel,
 							  Cost startup_cost, Cost total_cost,
 							  List *pathkeys, Relids required_outer);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index f1d111063c..b613392e2f 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -229,6 +229,9 @@ extern List *build_join_pathkeys(PlannerInfo *root,
 extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
 										   List *sortclauses,
 										   List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+										  List *sortclauses,
+										  List *tlist);
 extern void initialize_mergeclause_eclasses(PlannerInfo *root,
 											RestrictInfo *restrictinfo);
 extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -255,4 +258,10 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 									List *live_childrels);
 
+extern bool query_has_uniquekeys_for(PlannerInfo *root,
+									 List *exprs,
+									 bool allow_multinulls);
+
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+
 #endif							/* PATHS_H */
-- 
2.26.3

v39-0002-Index-skip-scan.patchtext/x-diff; charset=us-asciiDownload

From 13f8feafbe2d148318aed6869ffb695c38f2d4b0 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Thu, 20 May 2021 21:16:09 +0200
Subject: [PATCH v39 2/5] Index skip scan

Allow IndexOnlyScan to skip duplicated tuples based on search key prefix
(a trick also known as Index Skip Scan or Loose Index Scan, see in the
wiki [1]). The idea is to avoid scanning all equal values of a key, as
soon as a new value is found, restart the search by looking for a larger
value. This approach is much faster when the index has many equal keys.

Implemented via equipping IndexPath with indexskipprefix field and
creating an extra IndexPath with such prefix if suitable unique
expressions are present. On the execution size a new index am function
amskip is introduced to provide index specific implementation for such
skipping. To simplify potential amskip implementations,
ExecSupportsBackwardScan now returns false in case if index skip scan is
used, otherwise amskip has to deal with scroll cursor and be prepared to
handle different advance/read directions. ExecSupportsBackwardScan may
seem to have too big scope, but looks like now it used only together
with cursorOptions checks for CURSOR_OPT_SCROLL.

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 contrib/bloom/blutils.c                       |   1 +
 doc/src/sgml/config.sgml                      |  15 ++
 doc/src/sgml/indexam.sgml                     |  43 ++++++
 doc/src/sgml/indices.sgml                     |  23 +++
 src/backend/access/brin/brin.c                |   1 +
 src/backend/access/gin/ginutil.c              |   1 +
 src/backend/access/gist/gist.c                |   1 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/index/indexam.c            |  16 ++
 src/backend/access/spgist/spgutils.c          |   1 +
 src/backend/commands/explain.c                |  23 +++
 src/backend/executor/execAmi.c                |  32 +++-
 src/backend/executor/nodeIndexonlyscan.c      |  47 +++++-
 src/backend/nodes/copyfuncs.c                 |   1 +
 src/backend/nodes/outfuncs.c                  |   1 +
 src/backend/nodes/readfuncs.c                 |   1 +
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/optimizer/path/indxpath.c         | 140 +++++++++++++++++-
 src/backend/optimizer/path/pathkeys.c         |  54 ++++++-
 src/backend/optimizer/plan/createplan.c       |  10 +-
 src/backend/optimizer/util/pathnode.c         |  37 +++++
 src/backend/optimizer/util/plancat.c          |   1 +
 src/backend/utils/misc/guc.c                  |   9 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/amapi.h                    |   6 +
 src/include/access/genam.h                    |   1 +
 src/include/access/sdir.h                     |   7 +
 src/include/nodes/execnodes.h                 |   2 +
 src/include/nodes/pathnodes.h                 |   4 +
 src/include/nodes/plannodes.h                 |   2 +
 src/include/optimizer/cost.h                  |   1 +
 src/include/optimizer/pathnode.h              |   4 +
 src/include/optimizer/paths.h                 |   5 +-
 src/test/regress/expected/sysviews.out        |   3 +-
 34 files changed, 477 insertions(+), 19 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 754de008d4..f7288cf6dc 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -134,6 +134,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = blcostestimate;
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5efbfe97b5..114b4c9b5f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5056,6 +5056,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+      <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of index-skip-scan plan
+        types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+        <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-material" xreflabel="enable_material">
       <term><varname>enable_material</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 29fa3b7933..6c364a53d8 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -151,6 +151,7 @@ typedef struct IndexAmRoutine
     amendscan_function amendscan;
     ammarkpos_function ammarkpos;       /* can be NULL */
     amrestrpos_function amrestrpos;     /* can be NULL */
+    amskip_function amskip;             /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -768,6 +769,48 @@ amrestrpos (IndexScanDesc scan);
 
   <para>
 <programlisting>
+bool
+amskip (IndexScanDesc scan,
+        ScanDirection direction,
+        int prefix);
+</programlisting>
+  Skip past all tuples where the first 'prefix' columns have the same value as
+  the last tuple returned in the current scan. The arguments are:
+
+   <variablelist>
+    <varlistentry>
+     <term><parameter>scan</parameter></term>
+     <listitem>
+      <para>
+       Index scan information
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>direction</parameter></term>
+     <listitem>
+      <para>
+       The direction in which data is advancing.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>prefix</parameter></term>
+     <listitem>
+      <para>
+        Distinct prefix size.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+
+  </para>
+
+  <para>
+<programlisting>
 Size
 amestimateparallelscan (void);
 </programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 56fbd45178..b0c8744205 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1297,6 +1297,29 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
    and later will recognize such cases and allow index-only scans to be
    generated, but older versions will not.
   </para>
+
+  <sect2 id="indexes-index-skip-scans">
+    <title>Index Skip Scans</title>
+
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index</primary>
+      <secondary>index-skip scans</secondary>
+    </indexterm>
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index-skip scan</primary>
+    </indexterm>
+
+    <para>
+     When the rows retrieved from an index scan are then deduplicated by
+     eliminating rows matching on a prefix of index keys (e.g. when using
+     <literal>SELECT DISTINCT</literal>), the planner will consider
+     skipping groups of rows with a matching key prefix. When a row with
+     a particular prefix is found, remaining rows with the same key prefix
+     are skipped.  The larger the number of rows with the same key prefix
+     rows (i.e. the lower the number of distinct key prefixes in the index),
+     the more efficient this is.
+    </para>
+  </sect2>
  </sect1>
 
 
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index c23ea44866..f0ce6f7334 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -118,6 +118,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = brincostestimate;
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cdd626ff0a..aeb1d40b0f 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -66,6 +66,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = gincostestimate;
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..fed061184e 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -87,6 +87,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = gistcostestimate;
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0752fb38a9..fd7c13ee4c 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -84,6 +84,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = hashcostestimate;
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 5e22479b7a..310aacca58 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_skip		- advance past duplicate key values in a scan
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -739,6 +740,21 @@ index_can_return(Relation indexRelation, int attno)
 	return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
 }
 
+/* ----------------
+ *		index_skip
+ *
+ *		Skip past all tuples where the first 'prefix' columns have the
+ *		same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+	SCAN_CHECKS;
+
+	return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
 /* ----------------
  *		index_getprocid
  *
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 8d99c9b762..f9b1e395f4 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -71,6 +71,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = spgcostestimate;
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8ab7bca866..37ca8612f4 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -152,6 +152,7 @@ static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
 static void ExplainIndentText(ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 
 
@@ -1118,6 +1119,21 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
 }
 
+/*
+ * ExplainIndexSkipScanKeys -
+ *	  Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es)
+{
+	if (skipPrefixSize > 0)
+	{
+		ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+	}
+}
+
 /*
  * ExplainNode -
  *	  Appends a description of a plan tree to es->str
@@ -1475,6 +1491,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
 
+
 				ExplainIndexScanDetails(indexonlyscan->indexid,
 										indexonlyscan->indexorderdir,
 										es);
@@ -1748,6 +1765,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 										   planstate, es);
 			break;
 		case T_IndexOnlyScan:
+			if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+			{
+				IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+				ExplainPropertyBool("Skip scan", true, es);
+				ExplainIndexSkipScanKeys(indexonlyscan->indexskipprefixsize, es);
+			}
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexOnlyScan *) plan)->indexqual)
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b3726a54f3..4248912239 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -64,7 +64,7 @@
 #include "utils/rel.h"
 #include "utils/syscache.h"
 
-static bool IndexSupportsBackwardScan(Oid indexid);
+static bool IndexSupportsBackwardScan(Plan *node);
 
 
 /*
@@ -557,10 +557,8 @@ ExecSupportsBackwardScan(Plan *node)
 			return false;
 
 		case T_IndexScan:
-			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid);
-
 		case T_IndexOnlyScan:
-			return IndexSupportsBackwardScan(((IndexOnlyScan *) node)->indexid);
+			return IndexSupportsBackwardScan(node);
 
 		case T_SubqueryScan:
 			return ExecSupportsBackwardScan(((SubqueryScan *) node)->subplan);
@@ -604,16 +602,38 @@ ExecSupportsBackwardScan(Plan *node)
 
 /*
  * An IndexScan or IndexOnlyScan node supports backward scan only if the
- * index's AM does.
+ * index's AM does and no skip scan is used.
  */
 static bool
-IndexSupportsBackwardScan(Oid indexid)
+IndexSupportsBackwardScan(Plan *node)
 {
 	bool		result;
+	Oid			indexid = InvalidOid;
+	int         skip_prefix_size = 0;
 	HeapTuple	ht_idxrel;
 	Form_pg_class idxrelrec;
 	IndexAmRoutine *amroutine;
 
+	Assert(IsA(node, IndexScan) || IsA(node, IndexOnlyScan));
+	switch(nodeTag(node))
+	{
+		case T_IndexScan:
+			indexid = ((IndexScan *) node)->indexid;
+			break;
+
+		case T_IndexOnlyScan:
+			indexid = ((IndexOnlyScan *) node)->indexid;
+			skip_prefix_size = ((IndexOnlyScan *) node)->indexskipprefixsize;
+			break;
+
+		default:
+			elog(DEBUG2, "unrecognized node type: %d", (int) nodeTag(node));
+			break;
+	}
+
+	if (skip_prefix_size > 0)
+		return false;
+
 	/* Fetch the pg_class tuple of the index relation */
 	ht_idxrel = SearchSysCache1(RELOID, ObjectIdGetDatum(indexid));
 	if (!HeapTupleIsValid(ht_idxrel))
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 0754e28a9a..e800e43e35 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -41,6 +41,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/itemptr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -62,9 +63,17 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	EState	   *estate;
 	ExprContext *econtext;
 	ScanDirection direction;
+	ScanDirection readDirection;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
-	ItemPointer tid;
+	ItemPointer tid = NULL;
+	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
+
+	/*
+	 * Tells if the current position was reached via skipping. In this case
+	 * there is no nead for the index_getnext_tid
+	 */
+	bool skipped = false;
 
 	/*
 	 * extract necessary information from index scan node
@@ -72,7 +81,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -115,15 +124,43 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_NumOrderByKeys);
 	}
 
+	/*
+	 * Check if we need to skip to the next key prefix.
+	 */
+	if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+	{
+		if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+		{
+			/*
+			 * Reached end of index. At this point currPos is invalidated, and
+			 * we need to reset ioss_FirstTupleEmitted, since otherwise after
+			 * going backwards, reaching the end of index, and going forward
+			 * again we apply skip again. It would be incorrect and lead to an
+			 * extra skipped item.
+			 */
+			node->ioss_FirstTupleEmitted = false;
+			return ExecClearTuple(slot);
+		}
+		else
+		{
+			skipped = true;
+			tid = &scandesc->xs_heaptid;
+		}
+	}
+
+	readDirection = skipped ? indexonlyscan->indexorderdir : direction;
+
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (skipped || (tid = index_getnext_tid(scandesc, readDirection)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		skipped = false;
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -250,6 +287,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 							  ItemPointerGetBlockNumber(tid),
 							  estate->es_snapshot);
 
+		node->ioss_FirstTupleEmitted = true;
+
 		return slot;
 	}
 
@@ -504,6 +543,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+	indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->ioss_FirstTupleEmitted = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index f5a7760740..320d648540 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -517,6 +517,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
 	COPY_NODE_FIELD(indexorderby);
 	COPY_NODE_FIELD(indextlist);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c723f6d635..a6daf3847a 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -575,6 +575,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
 	WRITE_NODE_FIELD(indexorderby);
 	WRITE_NODE_FIELD(indextlist);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 3746668f52..ab5f237305 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1887,6 +1887,7 @@ _readIndexOnlyScan(void)
 	READ_NODE_FIELD(indexorderby);
 	READ_NODE_FIELD(indextlist);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8577c7b138..d449876855 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -133,6 +133,7 @@ int			max_parallel_workers_per_gather = 2;
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
+bool		enable_indexskipscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 0e4e00eaf0..2622f6e389 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -784,6 +784,16 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexPath  *ipath = (IndexPath *) lfirst(lc);
 
+		/*
+		 * To prevent unique paths from index skip scans being potentially used
+		 * when not needed scan keep them in a separate pathlist.
+		*/
+		if (ipath->indexskipprefix != 0)
+		{
+			add_unique_path(rel, (Path *) ipath);
+			continue;
+		}
+
 		if (index->amhasgettuple)
 			add_path(rel, (Path *) ipath);
 
@@ -866,12 +876,15 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	double		loop_count;
 	List	   *orderbyclauses;
 	List	   *orderbyclausecols;
-	List	   *index_pathkeys;
+	List	   *index_pathkeys = NIL;
 	List	   *useful_pathkeys;
+	List	   *index_pathkeys_pos = NIL;
 	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
+	bool		not_empty_qual = false;
+	bool		can_skip;
 	int			indexcol;
 
 	/*
@@ -989,7 +1002,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	if (index_is_ordered && pathkeys_possibly_useful)
 	{
 		index_pathkeys = build_index_pathkeys(root, index,
-											  ForwardScanDirection);
+											  ForwardScanDirection,
+											  &index_pathkeys_pos);
 		useful_pathkeys = truncate_useless_pathkeys(root, rel,
 													index_pathkeys);
 		orderbyclauses = NIL;
@@ -1021,6 +1035,72 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	index_only_scan = (scantype != ST_BITMAPSCAN &&
 					   check_index_only(rel, index));
 
+	/* Check if an index skip scan is possible. */
+	can_skip = enable_indexskipscan & index->amcanskip & index_only_scan;
+
+	if (can_skip)
+	{
+		/*
+		 * Skip scan is not supported when there are qual conditions, which are
+		 * not covered by index. The reason for that is that those conditions
+		 * are evaluated later, already after skipping was applied.
+		 *
+		 * TODO: This implementation is too restrictive, and doesn't allow e.g.
+		 * index expressions. For that we need to examine index_clauses too.
+		 */
+		if (root->parse->jointree != NULL)
+		{
+			ListCell *lc;
+
+			foreach(lc, (List *) root->parse->jointree->quals)
+			{
+				Node *expr, *qual = (Node *) lfirst(lc);
+				OpExpr *expr_op;
+				Var *var;
+				bool found = false;
+
+				if (!is_opclause(qual))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				expr = get_leftop(qual);
+				expr_op = (OpExpr *) qual;
+
+				if (!IsA(expr, Var))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				var = (Var *) expr;
+
+				/*
+				 * Check if the qual operator is indexable by any columns of
+				 * the index, test collation and opfamily.
+				 */
+				for (int i = 0; i < index->ncolumns; i++)
+				{
+					if (index->indexkeys[i] == var->varattno &&
+						IndexCollMatchesExprColl(index->indexcollations[i],
+												 expr_op->inputcollid) &&
+						op_in_opfamily(expr_op->opno, index->opfamily[i]))
+					{
+						found = true;
+						break;
+					}
+				}
+
+				if (!found)
+				{
+					not_empty_qual = true;
+					break;
+				}
+			}
+		}
+	}
+
 	/*
 	 * 4. Generate an indexscan path if there are relevant restriction clauses
 	 * in the current clauses, OR the index ordering is potentially useful for
@@ -1044,6 +1124,33 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  false);
 		result = lappend(result, ipath);
 
+		/* Consider index skip scan as well */
+		if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
+		{
+			int numusefulkeys = list_length(useful_pathkeys);
+			int numsortkeys = list_length(root->query_pathkeys);
+
+			if (numusefulkeys == numsortkeys)
+			{
+				int prefix;
+				if (list_length(root->distinct_pathkeys) > 0)
+					prefix = find_index_prefix_for_pathkey(index_pathkeys,
+														   index_pathkeys_pos,
+														   llast_node(PathKey,
+														   root->distinct_pathkeys));
+				else
+					/*
+					 * All distinct keys are constant and optimized away.
+					 * Skipping with 1 is sufficient.
+					 */
+					prefix = 1;
+
+				result = lappend(result,
+								 create_skipscan_unique_path(root, index,
+															 (Path *) ipath, prefix));
+			}
+		}
+
 		/*
 		 * If appropriate, consider parallel index scan.  We don't allow
 		 * parallel index scan for bitmap index scans.
@@ -1082,7 +1189,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	if (index_is_ordered && pathkeys_possibly_useful)
 	{
 		index_pathkeys = build_index_pathkeys(root, index,
-											  BackwardScanDirection);
+											  BackwardScanDirection,
+											  &index_pathkeys_pos);
 		useful_pathkeys = truncate_useless_pathkeys(root, rel,
 													index_pathkeys);
 		if (useful_pathkeys != NIL)
@@ -1099,6 +1207,32 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  false);
 			result = lappend(result, ipath);
 
+			/* Consider index skip scan as well */
+			if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
+			{
+				int numusefulkeys = list_length(useful_pathkeys);
+				int numsortkeys = list_length(root->query_pathkeys);
+
+				if (numusefulkeys == numsortkeys)
+				{
+					int prefix;
+					if (list_length(root->distinct_pathkeys) > 0)
+						prefix = find_index_prefix_for_pathkey(index_pathkeys,
+															   index_pathkeys_pos,
+															   llast_node(PathKey,
+															   root->distinct_pathkeys));
+					else
+						/* all are distinct keys are constant and optimized away.
+						 * skipping with 1 is sufficient as all are constant anyway
+						 */
+						prefix = 1;
+
+					result = lappend(result,
+									 create_skipscan_unique_path(root, index,
+																 (Path *) ipath, prefix));
+				}
+			}
+
 			/* If appropriate, consider parallel index scan */
 			if (index->amcanparallel &&
 				rel->consider_parallel && outer_relids == NULL &&
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index f28547148d..154cb30745 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -523,6 +523,47 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  *		NEW PATHKEY FORMATION
  ****************************************************************************/
 
+/*
+ * Find the prefix size for a specific path key in an index. For example, an
+ * index with (a,b,c) finding path key b will return prefix 2. Optionally
+ * pathkeys_positions can be provided, to specify at which position in the
+ * original pathkey list this particular key could be found (this is helpful
+ * when we deal with redundant pathkeys).
+ *
+ * Returns 0 when not found.
+ */
+int
+find_index_prefix_for_pathkey(List *index_pathkeys,
+							  List *pathkeys_positions,
+							  PathKey *target_pathkey)
+{
+	ListCell   *lc;
+	int			i;
+
+	i = 0;
+	foreach(lc, index_pathkeys)
+	{
+		PathKey    *cpathkey = (PathKey *) lfirst(lc);
+
+		if (cpathkey == target_pathkey)
+		{
+			/*
+			 * Prefix expected to start from 1, increment positions since
+			 * they're 0 based.
+			 */
+			if (pathkeys_positions != NIL &&
+				pathkeys_positions->length > i)
+				return list_nth_int(pathkeys_positions, i) + 1;
+			else
+				return i + 1;
+		}
+
+		i++;
+	}
+
+	return 0;
+}
+
 /*
  * build_index_pathkeys
  *	  Build a pathkeys list that describes the ordering induced by an index
@@ -535,7 +576,9 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  * We iterate only key columns of covering indexes, since non-key columns
  * don't influence index ordering.  The result is canonical, meaning that
  * redundant pathkeys are removed; it may therefore have fewer entries than
- * there are key columns in the index.
+ * there are key columns in the index. Since by removing redundant pathkeys the
+ * information about original position is lost, return it via positions
+ * argument.
  *
  * Another reason for stopping early is that we may be able to tell that
  * an index column's sort order is uninteresting for this query.  However,
@@ -546,7 +589,8 @@ get_cheapest_parallel_safe_total_inner(List *paths)
 List *
 build_index_pathkeys(PlannerInfo *root,
 					 IndexOptInfo *index,
-					 ScanDirection scandir)
+					 ScanDirection scandir,
+					 List **positions)
 {
 	List	   *retval = NIL;
 	ListCell   *lc;
@@ -555,6 +599,8 @@ build_index_pathkeys(PlannerInfo *root,
 	if (index->sortopfamily == NULL)
 		return NIL;				/* non-orderable index */
 
+	*positions = NIL;
+
 	i = 0;
 	foreach(lc, index->indextlist)
 	{
@@ -608,7 +654,11 @@ build_index_pathkeys(PlannerInfo *root,
 			 * for this query.  Add it to list, unless it's redundant.
 			 */
 			if (!pathkey_is_redundant(cpathkey, retval))
+			{
 				retval = lappend(retval, cpathkey);
+				*positions = lappend_int(*positions,
+										 foreach_current_index(lc));
+			}
 		}
 		else
 		{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index a9aff24831..513466f7b6 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -190,7 +190,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 Index scanrelid, Oid indexid,
 										 List *indexqual, List *indexorderby,
 										 List *indextlist,
-										 ScanDirection indexscandir);
+										 ScanDirection indexscandir,
+										 int skipprefix);
 static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
 											  List *indexqual,
 											  List *indexqualorig);
@@ -3067,7 +3068,8 @@ create_indexscan_plan(PlannerInfo *root,
 												fixed_indexquals,
 												fixed_indexorderbys,
 												best_path->indexinfo->indextlist,
-												best_path->indexscandir);
+												best_path->indexscandir,
+												best_path->indexskipprefix);
 	else
 		scan_plan = (Scan *) make_indexscan(tlist,
 											qpqual,
@@ -5439,7 +5441,8 @@ make_indexonlyscan(List *qptlist,
 				   List *indexqual,
 				   List *indexorderby,
 				   List *indextlist,
-				   ScanDirection indexscandir)
+				   ScanDirection indexscandir,
+				   int skipPrefixSize)
 {
 	IndexOnlyScan *node = makeNode(IndexOnlyScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5454,6 +5457,7 @@ make_indexonlyscan(List *qptlist,
 	node->indexorderby = indexorderby;
 	node->indextlist = indextlist;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e190b20675..0b24f670b4 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3080,6 +3080,43 @@ create_upper_unique_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_skipscan_unique_path
+ *	  Creates a pathnode the same as an existing IndexPath except based on
+ *	  skipping duplicate values.  This may or may not be cheaper than using
+ *	  create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root, IndexOptInfo *index,
+							Path *basepath, int prefix)
+{
+	IndexPath 	*pathnode = makeNode(IndexPath);
+	int 		numDistinctRows;
+	List 		*uniqExprs;
+
+	Assert(IsA(basepath, IndexPath));
+
+	/* We don't want to modify basepath, so make a copy. */
+	memcpy(pathnode, basepath, sizeof(IndexPath));
+
+	uniqExprs = linitial_node(List, root->query_uniquekeys);
+
+	Assert(prefix > 0);
+	pathnode->indexskipprefix = prefix;
+	pathnode->path.uniquekeys = root->query_uniquekeys;
+
+	numDistinctRows = estimate_num_groups(root, uniqExprs,
+										  pathnode->path.rows,
+										  NULL, NULL);
+
+	pathnode->path.total_cost = pathnode->path.startup_cost * numDistinctRows;
+	pathnode->path.rows = numDistinctRows;
+
+	return pathnode;
+}
+
 /*
  * create_agg_path
  *	  Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index c5194fdbbf..df20b88316 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -271,6 +271,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanskip = (amroutine->amskip != NULL);
 			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f2c7c2486b..f0671e406c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -985,6 +985,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-skip-scan plans."),
+			NULL
+		},
+		&enable_indexskipscan,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 0f7f49b949..0be508ef04 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -365,6 +365,7 @@
 #enable_hashjoin = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexskipscan = on
 #enable_material = on
 #enable_mergejoin = on
 #enable_nestloop = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index d357ebb559..b1e82f4fb5 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -173,6 +173,11 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+								 ScanDirection dir,
+								 int prefix);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -275,6 +280,7 @@ typedef struct IndexAmRoutine
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
 	amrestrpos_function amrestrpos; /* can be NULL */
+	amskip_function amskip;				/* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 480a4762f5..e757f01b0d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -183,6 +183,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
 												   IndexBulkDeleteResult *istat);
 extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/sdir.h b/src/include/access/sdir.h
index 8154adf3b8..6690c2a0b0 100644
--- a/src/include/access/sdir.h
+++ b/src/include/access/sdir.h
@@ -55,4 +55,11 @@ typedef enum ScanDirection
 #define ScanDirectionIsForward(direction) \
 	((bool) ((direction) == ForwardScanDirection))
 
+/*
+ * ScanDirectionsAreOpposite
+ *		True iff scan directions are backward/forward or forward/backward.
+ */
+#define ScanDirectionsAreOpposite(dirA, dirB) \
+	((bool) (dirA != NoMovementScanDirection && dirA == -dirB))
+
 #endif							/* SDIR_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e7ae21c023..c480794f59 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1517,6 +1517,8 @@ typedef struct IndexOnlyScanState
 	struct IndexScanDescData *ioss_ScanDesc;
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
+	int         ioss_SkipPrefixSize;
+	bool		ioss_FirstTupleEmitted;
 	Size		ioss_PscanLen;
 } IndexOnlyScanState;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 6c04ba2bb2..d53dda7ef7 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1234,6 +1234,9 @@ typedef struct Path
  * we need not recompute them when considering using the same index in a
  * bitmap index/heap scan (see BitmapHeapPath).  The costs of the IndexPath
  * itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
  *----------
  */
 typedef struct IndexPath
@@ -1246,6 +1249,7 @@ typedef struct IndexPath
 	ScanDirection indexscandir;
 	Cost		indextotalcost;
 	Selectivity indexselectivity;
+	int			indexskipprefix;
 } IndexPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d671328dfd..530e9ab875 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -434,6 +434,8 @@ typedef struct IndexOnlyScan
 	List	   *indexorderby;	/* list of index ORDER BY exprs */
 	List	   *indextlist;		/* TargetEntry list describing index's cols */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for distinct
+										 * scans */
 } IndexOnlyScan;
 
 /* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 0fe60d82e4..baa8655c55 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
 extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index b976f2963e..e9e56873a6 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -217,6 +217,10 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
 												 Path *subpath,
 												 int numCols,
 												 double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+											  IndexOptInfo *index,
+											  Path *subpath,
+											  int prefix);
 extern AggPath *create_agg_path(PlannerInfo *root,
 								RelOptInfo *rel,
 								Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index b613392e2f..d4288f8451 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -212,8 +212,11 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   Relids required_outer,
 													   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
+extern int find_index_prefix_for_pathkey(List *index_pathkeys,
+										 List *pathkey_positions,
+										 PathKey *target_pathkey);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
-								  ScanDirection scandir);
+								  ScanDirection scandir, List **positions);
 extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
 									  ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 0bb558d93c..fa54f00ae7 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -103,6 +103,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexskipscan           | on
  enable_material                | on
  enable_mergejoin               | on
  enable_nestloop                | on
@@ -115,7 +116,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(20 rows)
+(21 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
-- 
2.26.3

v39-0003-amskip-implementation-for-Btree.patchtext/x-diff; charset=us-asciiDownload

From 8576c0285501b213787054c8a6d8e2bb9d44d4be Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Thu, 20 May 2021 21:13:38 +0200
Subject: [PATCH v39 3/5] amskip implementation for Btree

Btree implementation of index am method amskip for Index Skip Scan. To
make it more robust and suitable for both situations:

* small number of distinct values (e.g. because of the planner
underestimation)

* significant amounts of distinct values

a mixed approach is implemented. Instead of restarting the search for
every value, first we check if there is a next distinct value on the
current page. Only then if no such value was found restart and search
from the tree root.

No support for backward scan is implemented in case of a scroll cursor,
instead a Material node will be put on top.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 src/backend/access/nbtree/nbtree.c            |  12 +
 src/backend/access/nbtree/nbtsearch.c         | 217 +++++-
 src/include/access/nbtree.h                   |   5 +
 src/test/regress/expected/join.out            |   3 +
 src/test/regress/expected/select_distinct.out | 642 ++++++++++++++++++
 src/test/regress/sql/join.sql                 |   5 +
 src/test/regress/sql/select_distinct.sql      | 282 ++++++++
 7 files changed, 1165 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1360ab80c1..e57e81e593 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -123,6 +123,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
+	amroutine->amskip = btskip;
 	amroutine->amcostestimate = btcostestimate;
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
@@ -374,6 +375,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	 */
 	so->currTuples = so->markTuples = NULL;
 
+	so->skipScanKey = NULL;
+
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
@@ -440,6 +443,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	_bt_preprocess_array_keys(scan);
 }
 
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+	return _bt_skip(scan, direction, prefix);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d1177d8772..d76d7c27ce 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -45,7 +45,11 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+											Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan,
+										   BTScanInsert key,
+										   Buffer buf);
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -1486,6 +1490,163 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 	return true;
 }
 
+/*
+ *  _bt_skip() -- Skip items that have the same prefix as the most recently
+ * 				  fetched index tuple.
+ *
+ * 		The current position is set so that a subsequent call to _bt_next will
+ * 		fetch the first tuple that differs in the leading 'prefix' keys.
+ *
+ * 		The current page is searched for the next unique value. If none is found
+ * 		we will do a scan from the root in order to find the next page with
+ * 		a unique value.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTStack stack;
+	Buffer buf;
+	OffsetNumber offnum;
+	BTScanPosItem *currItem;
+	Relation 	 indexRel = scan->indexRelation;
+	bool scanstart = !BTScanPosIsValid(so->currPos);
+
+	/* We want to return tuples, and we need a starting point */
+	Assert(scan->xs_want_itup);
+	Assert(scan->xs_itup);
+
+	if (so->numKilled > 0)
+		_bt_killitems(scan);
+
+	/* If skipScanKey is NULL then we initialize it with _bt_mkscankey */
+	if (so->skipScanKey == NULL)
+	{
+		so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+		so->skipScanKey->keysz = prefix;
+		so->skipScanKey->scantid = NULL;
+	}
+	so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+	_bt_update_skip_scankeys(scan, indexRel);
+
+	/* Check if the next unique key can be found within the current page.
+	 * Since we do not lock the current page between jumps, it's possible
+	 * that it was splitted since the last time we saw it. This is fine in
+	 * case of scanning forward, since page split to the right and we are
+	 * still on the left most page. In case of scanning backwards it's
+	 * possible to loose some pages and we need to remember the previous
+	 * page, and then follow the right link from the current page until we
+	 * find the original one.
+	 *
+	 * Since the whole idea of checking the current page is to protect
+	 * ourselves and make more performant statistic mismatch case when
+	 * there are too many distinct values for jumping, it's not clear if
+	 * the complexity of this solution in case of backward scan is
+	 * justified, so for now just avoid it.
+	 */
+	if (BufferIsValid(so->currPos.buf) && ScanDirectionIsForward(dir))
+	{
+		_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+
+		if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf))
+		{
+			bool keyFound = false;
+
+			offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+			/* Lock the page for SERIALIZABLE transactions */
+			PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+							  scan->xs_snapshot);
+
+			/* We know in which direction to look */
+			_bt_initialize_more_data(so, dir);
+
+			/* Now read the data */
+			keyFound = _bt_readpage(scan, dir, offnum);
+
+			_bt_unlockbuf(indexRel, so->currPos.buf);
+			ReleaseBuffer(so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+
+			if (keyFound)
+			{
+				/* set IndexTuple */
+				currItem = &so->currPos.items[so->currPos.itemIndex];
+				scan->xs_heaptid = currItem->heapTid;
+				scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+				return true;
+			}
+		}
+		else
+			_bt_unlockbuf(indexRel, so->currPos.buf);
+	}
+
+	if (BufferIsValid(so->currPos.buf))
+	{
+		ReleaseBuffer(so->currPos.buf);
+		so->currPos.buf = InvalidBuffer;
+	}
+
+	/*
+	 * We haven't found scan key within the current page, so let's scan from
+	 * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+	 * number
+	 */
+	stack = _bt_search(scan->indexRelation, so->skipScanKey,
+					   &buf, BT_READ, scan->xs_snapshot);
+	_bt_freestack(stack);
+	so->currPos.buf = buf;
+	offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+	/* Lock the page for SERIALIZABLE transactions */
+	PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+					  scan->xs_snapshot);
+
+	/* We know in which direction to look */
+	_bt_initialize_more_data(so, dir);
+
+	/*
+	 * Simplest case is when both directions are forward, when we are already
+	 * at the next distinct key at the beginning of the series (so everything
+	 * else would be done in _bt_readpage)
+	 *
+	 * The case when both directions are backwards is also simple, but we need
+	 * to go one step back, since we need a last element from the previous
+	 * series.
+	 */
+	if (ScanDirectionIsBackward(dir) || (ScanDirectionIsForward(dir) & scanstart))
+		 offnum = OffsetNumberPrev(offnum);
+
+	/* Now read the data */
+	if (!_bt_readpage(scan, dir, offnum))
+	{
+		/*
+		 * There's no actually-matching data on this page.  Try to advance to
+		 * the next page.  Return false if there's no matching data at all.
+		 */
+		_bt_unlockbuf(indexRel, so->currPos.buf);
+		if (!_bt_steppage(scan, dir))
+		{
+			pfree(so->skipScanKey);
+			so->skipScanKey = NULL;
+			return false;
+		}
+	}
+	else
+		/* Drop the lock, and maybe the pin, on the current page */
+		_bt_unlockbuf(indexRel, so->currPos.buf);
+
+	/* And set IndexTuple */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_heaptid = currItem->heapTid;
+	scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+	so->currPos.moreLeft = true;
+	so->currPos.moreRight = true;
+
+	return true;
+}
+
 /*
  *	_bt_readpage() -- Load data from current index page into so->currPos
  *
@@ -2482,3 +2643,57 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 	so->numKilled = 0;			/* just paranoia */
 	so->markItemIndex = -1;		/* ditto */
 }
+
+/*
+ * _bt_update_skip_scankeys() -- set up new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+	TupleDesc		itupdesc;
+	int			indnkeyatts,
+				i;
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	ScanKey			scankeys = so->skipScanKey->scankeys;
+
+	itupdesc = RelationGetDescr(indexRel);
+	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+	for (i = 0; i < indnkeyatts; i++)
+	{
+		Datum datum;
+		bool null;
+		int flags;
+
+		datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+		flags = (null ? SK_ISNULL : 0) |
+				(indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+		scankeys[i].sk_flags = flags;
+		scankeys[i].sk_argument = datum;
+	}
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ *
+ * Scankey nextkey will tell us if we need to find a current key or the next
+ * one, which affects whether or not it's ok to be equal to the page highkey.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, Buffer buf)
+{
+	OffsetNumber low, high;
+	Page page = BufferGetPage(buf);
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	int high_compare = key->nextkey ? 0 : 1;
+
+	low = P_FIRSTDATAKEY(opaque);
+	high = PageGetMaxOffsetNumber(page);
+
+	if (unlikely(high < low))
+		return false;
+
+	return (_bt_compare(scan->indexRelation, key, page, low) > 0 &&
+			_bt_compare(scan->indexRelation, key, page, high) < high_compare);
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a645c42e68..8a45745368 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1064,6 +1064,9 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* Work space for _bt_skip */
+	BTScanInsert	skipScanKey;	/* used to control skipping */
+
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -1229,6 +1232,7 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 							   Snapshot snapshot);
 
@@ -1253,6 +1257,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
 extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
 extern bool btproperty(Oid index_oid, int attno,
 					   IndexAMProperty prop, const char *propname,
 					   bool *res, bool *isnull);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 86fd3907c5..6539b583a4 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4600,6 +4600,8 @@ select d.* from d left join (select * from b group by b.id, b.c_id) s
          ->  Seq Scan on d
 (8 rows)
 
+-- disable index skip scan to prevent it interfering with the plan
+set enable_indexskipscan to off;
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
@@ -4617,6 +4619,7 @@ select d.* from d left join (select distinct * from b) s
          ->  Seq Scan on d
 (9 rows)
 
+set enable_indexskipscan to on;
 -- check join removal works when uniqueness of the join condition is enforced
 -- by a UNION
 explain (costs off)
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index 11c6f50fbf..da500365fb 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -306,3 +306,645 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
  t
 (1 row)
 
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+CREATE INDEX ON distinct_a ((a + 1));
+ANALYZE distinct_a;
+SELECT DISTINCT a FROM distinct_a;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+ a 
+---
+ 1
+(1 row)
+
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Index Only Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+   Distinct Prefix: 1
+(3 rows)
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- test index skip scan for expressions
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT (a + 1) FROM distinct_a ORDER BY (a + 1);
+             QUERY PLAN             
+------------------------------------
+ Sort
+   Sort Key: ((a + 1))
+   ->  HashAggregate
+         Group Key: (a + 1)
+         ->  Seq Scan on distinct_a
+(5 rows)
+
+SELECT DISTINCT (a + 1) FROM distinct_a ORDER BY (a + 1);
+ ?column? 
+----------
+        2
+        3
+        4
+        5
+        6
+(5 rows)
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: true
+   Distinct Prefix: 2
+   Index Cond: (b = 2)
+(4 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: true
+   Distinct Prefix: 2
+   Index Cond: (b = 2)
+(4 rows)
+
+DROP INDEX distinct_a_b_a;
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b 
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+                          QUERY PLAN                          
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: true
+   Distinct Prefix: 1
+   Index Cond: (c = 2)
+(4 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+                              QUERY PLAN                               
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: true
+   Distinct Prefix: 1
+   Index Cond: (c = 2)
+(4 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Unique
+   ->  Index Scan using distinct_a_a_b_idx on distinct_a
+         Index Cond: (b = 2)
+         Filter: (c = 10)
+(4 rows)
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+ a | a 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 3
+ 4 | 4
+ 5 | 5
+(5 rows)
+
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+ a | ?column? 
+---+----------
+ 1 |        1
+ 2 |        1
+ 3 |        1
+ 4 |        1
+ 5 |        1
+(5 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+FETCH FROM c;
+ a 
+---
+ 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a 
+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 |  9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+                                 QUERY PLAN                                 
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+   Skip scan: true
+   Distinct Prefix: 1
+   Index Cond: ((b >= 1) AND (c = 0))
+(4 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c 
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
+-- test tuple killing
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 1 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 5 | 1 | 1 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 5 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 1 | 1 | 1 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(5 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(5 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- test posting lists
+CREATE TABLE distinct_posting (a int, b int, c int);
+CREATE INDEX ON distinct_posting (a, b, c);
+INSERT INTO distinct_posting
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+INSERT INTO distinct_posting (
+    SELECT 1 as a, 1 as b, 1 AS c
+        FROM generate_series(1,1000) i
+);
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c FROM distinct_posting WHERE c = 2
+    ORDER BY a DESC, b DESC;
+    FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+COMMIT;
+-- test that quals are check for indexability before applied
+CREATE TABLE Indexable_quals (a text, b text, c text);
+CREATE INDEX ON indexable_quals (a, b, c);
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'yyy');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'yyy');
+SELECT DISTINCT ON (a, b)  a, b
+FROM indexable_quals WHERE c LIKE '%y%' AND a LIKE 'a%' AND b = 'b';
+ a  | b 
+----+---
+ a1 | b
+ a2 | b
+(2 rows)
+
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 7f866c603b..587912eb58 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1582,11 +1582,16 @@ explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
   on d.a = s.id;
 
+-- disable index skip scan to prevent it interfering with the plan
+set enable_indexskipscan to off;
+
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
   on d.a = s.id;
 
+set enable_indexskipscan to on;
+
 -- check join removal works when uniqueness of the join condition is enforced
 -- by a UNION
 explain (costs off)
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index 33102744eb..797ed6e3dc 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -135,3 +135,285 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
 SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
 SELECT 2 IS NOT DISTINCT FROM null as "no";
 SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+CREATE INDEX ON distinct_a ((a + 1));
+ANALYZE distinct_a;
+
+SELECT DISTINCT a FROM distinct_a;
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- test index skip scan for expressions
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT (a + 1) FROM distinct_a ORDER BY (a + 1);
+SELECT DISTINCT (a + 1) FROM distinct_a ORDER BY (a + 1);
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+DROP INDEX distinct_a_b_a;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
+
+-- test tuple killing
+
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- test posting lists
+CREATE TABLE distinct_posting (a int, b int, c int);
+CREATE INDEX ON distinct_posting (a, b, c);
+INSERT INTO distinct_posting
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+INSERT INTO distinct_posting (
+    SELECT 1 as a, 1 as b, 1 AS c
+        FROM generate_series(1,1000) i
+);
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c FROM distinct_posting WHERE c = 2
+    ORDER BY a DESC, b DESC;
+    FETCH ALL FROM c;
+
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+-- test that quals are check for indexability before applied
+CREATE TABLE Indexable_quals (a text, b text, c text);
+CREATE INDEX ON indexable_quals (a, b, c);
+
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'yyy');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'yyy');
+
+SELECT DISTINCT ON (a, b)  a, b
+FROM indexable_quals WHERE c LIKE '%y%' AND a LIKE 'a%' AND b = 'b';
-- 
2.26.3

v39-0004-Extend-amskip-implementation-for-Btree.patchtext/x-diff; charset=us-asciiDownload

From a6ca2ab4a78b7d0fe84cd4fbcae597d69a836c5a Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Thu, 20 May 2021 21:17:51 +0200
Subject: [PATCH v39 4/5] Extend amskip implementation for Btree

Add support for backward scan to Btree amskip implementation. This will
make index skip scan work without Material node in case of scrolling cursor.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 src/backend/access/index/indexam.c       |   6 +-
 src/backend/access/nbtree/nbtree.c       |   5 +-
 src/backend/access/nbtree/nbtsearch.c    | 301 ++++++++++++++++++++++-
 src/backend/executor/execAmi.c           |  32 +--
 src/backend/executor/nodeIndexonlyscan.c |  57 ++++-
 src/include/access/amapi.h               |   1 +
 src/include/access/genam.h               |   3 +-
 src/include/access/nbtree.h              |   6 +-
 8 files changed, 370 insertions(+), 41 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 310aacca58..1a1d59321e 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -748,11 +748,13 @@ index_can_return(Relation indexRelation, int attno)
  * ----------------
  */
 bool
-index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+index_skip(IndexScanDesc scan, ScanDirection direction,
+		   ScanDirection indexdir, bool scanstart, int prefix)
 {
 	SCAN_CHECKS;
 
-	return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+	return scan->indexRelation->rd_indam->amskip(scan, direction,
+												 indexdir, prefix);
 }
 
 /* ----------------
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e57e81e593..77e2fbdb3f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -447,9 +447,10 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
  * btskip() -- skip to the beginning of the next key prefix
  */
 bool
-btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+btskip(IndexScanDesc scan, ScanDirection direction,
+	   ScanDirection indexdir, int prefix)
 {
-	return _bt_skip(scan, direction, prefix);
+	return _bt_skip(scan, direction, indexdir, prefix);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d76d7c27ce..cac5afcf7a 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1497,12 +1497,31 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * 		The current position is set so that a subsequent call to _bt_next will
  * 		fetch the first tuple that differs in the leading 'prefix' keys.
  *
- * 		The current page is searched for the next unique value. If none is found
- * 		we will do a scan from the root in order to find the next page with
- * 		a unique value.
+ * 		There are four different kinds of skipping (depending on dir and
+ * 		indexdir, that are important to distinguish, especially in the presense
+ * 		of an index condition:
+ *
+ * 		* Advancing forward and reading forward
+ * 			simple scan
+ *
+ * 		* Advancing forward and reading backward
+ * 			scan inside a cursor fetching backward, when skipping is necessary
+ * 			right from the start
+ *
+ * 		* Advancing backward and reading forward
+ * 			scan with order by desc inside a cursor fetching forward, when
+ * 			skipping is necessary right from the start
+ *
+ * 		* Advancing backward and reading backward
+ * 			simple scan with order by desc
+ *
+ *      The current page is searched for the next unique value. If none is found
+ *      we will do a scan from the root in order to find the next page with
+ *      a unique value.
  */
 bool
-_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+_bt_skip(IndexScanDesc scan, ScanDirection dir,
+		 ScanDirection indexdir, int prefix)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTStack stack;
@@ -1614,11 +1633,281 @@ _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
 	 * to go one step back, since we need a last element from the previous
 	 * series.
 	 */
-	if (ScanDirectionIsBackward(dir) || (ScanDirectionIsForward(dir) & scanstart))
+	if ((ScanDirectionIsBackward(dir) && ScanDirectionIsBackward(indexdir)) ||
+		(ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir) & scanstart))
 		 offnum = OffsetNumberPrev(offnum);
 
+	/*
+	 * Andvance backward but read forward. At this moment we are at the next
+	 * distinct key at the beginning of the series. In case if scan just
+	 * started, we can read forward without doing anything else. Otherwise
+	 * find previous distinct key and the beginning of it's series and read
+	 * forward from there. To do so, go back one step, perform binary search
+	 * to find the first item in the series and let _bt_readpage do everything
+	 * else.
+	 */
+	else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir) && !scanstart)
+	{
+		/* Reading forward means we expect to see more data on the right */
+		so->currPos.moreRight = true;
+
+		offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+		/* One step back to find a previous value */
+		_bt_readpage(scan, dir, offnum);
+
+		_bt_unlockbuf(indexRel, so->currPos.buf);
+		if (_bt_next(scan, dir))
+		{
+			_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+			_bt_update_skip_scankeys(scan, indexRel);
+
+			/*
+			 * And now find the last item from the sequence for the
+			 * current, value with the intention do OffsetNumberNext. As a
+			 * result we end up on a first element from the sequence.
+			 */
+			if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf))
+				offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+			else
+			{
+				if (BufferIsValid(so->currPos.buf))
+				{
+					/* Before leaving current page, deal with any killed items */
+					if (so->numKilled > 0)
+						_bt_killitems(scan);
+
+					_bt_unlockbuf(indexRel, so->currPos.buf);
+					ReleaseBuffer(so->currPos.buf);
+					so->currPos.buf = InvalidBuffer;
+				}
+
+				stack = _bt_search(scan->indexRelation, so->skipScanKey,
+								   &buf, BT_READ, scan->xs_snapshot);
+				_bt_freestack(stack);
+				so->currPos.buf = buf;
+				offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+			}
+		}
+		else
+		{
+			pfree(so->skipScanKey);
+			so->skipScanKey = NULL;
+			return false;
+		}
+	}
+
+	/*
+	 * Advance forward but read backward. At this moment we are at the next
+	 * distinct key at the beginning of the series. In case if scan just
+	 * started, we can go one step back and read forward without doing
+	 * anything else. Otherwise find the next distinct key and the beginning
+	 * of it's series, go one step back and read backward from there.
+	 *
+	 * An interesting situation can happen if one of distinct keys do not pass
+	 * a corresponding index condition at all. In this case reading backward
+	 * can lead to a previous distinct key being found, creating a loop. To
+	 * avoid that check the value to be returned, and jump one more time if
+	 * it's the same as at the beginning. Note that we do not check visibility
+	 * here, and dead tuples could also lead to the same situation. This has to
+	 * be checked on the caller side.
+	 */
+	else if (ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir) && !scanstart)
+	{
+		IndexTuple 	startItup = CopyIndexTuple(scan->xs_itup);
+		bool 		nextFound = false;
+
+		/* Reading backwards means we expect to see more data on the left */
+		so->currPos.moreLeft = true;
+
+		while (!nextFound)
+		{
+			IndexTuple itup;
+			OffsetNumber jumpOffset;
+			CHECK_FOR_INTERRUPTS();
+
+			/*
+			 * Find a next index tuple to update scan key. It could be at
+			 * the end, so check for max offset
+			 */
+			if (!_bt_readpage(scan, ForwardScanDirection, offnum))
+			{
+				/*
+				 * There's no actually-matching data on this page.  Try to
+				 * advance to the next page. Return false if there's no
+				 * matching data at all.
+				 */
+				_bt_unlockbuf(indexRel, so->currPos.buf);
+				if (!_bt_steppage(scan, dir))
+				{
+					pfree(so->skipScanKey);
+					so->skipScanKey = NULL;
+					return false;
+				}
+				_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+			}
+
+			currItem = &so->currPos.items[so->currPos.firstItem];
+			itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+			scan->xs_itup = itup;
+
+			_bt_update_skip_scankeys(scan, indexRel);
+			if (BufferIsValid(so->currPos.buf))
+			{
+				/* Before leaving current page, deal with any killed items */
+				if (so->numKilled > 0)
+					_bt_killitems(scan);
+
+				_bt_unlockbuf(indexRel, so->currPos.buf);
+				ReleaseBuffer(so->currPos.buf);
+				so->currPos.buf = InvalidBuffer;
+			}
+
+			stack = _bt_search(scan->indexRelation, so->skipScanKey,
+							   &buf, BT_READ, scan->xs_snapshot);
+			_bt_freestack(stack);
+			so->currPos.buf = buf;
+
+			/*
+			 * We need to remember the original offset after the jump,
+			 * since in case of looping this would be the next starting
+			 * point
+			 */
+			jumpOffset = offnum = _bt_binsrch(scan->indexRelation,
+											  so->skipScanKey, buf);
+			offnum = OffsetNumberPrev(offnum);
+
+			if (!_bt_readpage(scan, indexdir, offnum))
+			{
+				/*
+				 * There's no actually-matching data on this page.  Try to
+				 * advance to the next page. Return false if there's no
+				 * matching data at all.
+				 */
+				_bt_unlockbuf(indexRel, so->currPos.buf);
+				if (!_bt_steppage(scan, indexdir))
+				{
+					pfree(so->skipScanKey);
+					so->skipScanKey = NULL;
+					return false;
+				}
+				_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+			}
+
+			currItem = &so->currPos.items[so->currPos.lastItem];
+			itup = CopyIndexTuple((IndexTuple)
+					(so->currTuples + currItem->tupleOffset));
+
+			/*
+			 * To check if we returned the same tuple, try to find a
+			 * startItup on the current page. For that we need to update
+			 * scankey to match the whole tuple and set nextkey to return
+			 * an exact tuple, not the next one. If the tuple we found in
+			 * this way is equal to what we wanted to return, it means we
+			 * are in the loop, return offnum to the original position and
+			 * jump further
+			 *
+			 * Note that to compare tids we need to keep the leaf pinned,
+			 * otherwise there is a danger of vacuum cleaning up relevant
+			 * tuples.
+			 */
+			scan->xs_itup = startItup;
+			_bt_update_skip_scankeys(scan, indexRel);
+
+			so->skipScanKey->keysz = IndexRelationGetNumberOfKeyAttributes(indexRel);
+			so->skipScanKey->nextkey = false;
+
+			if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf))
+			{
+				OffsetNumber maxoff, startOffset;
+				IndexTuple verifiedItup;
+				Page page = BufferGetPage(so->currPos.buf);
+				startOffset = _bt_binsrch(scan->indexRelation,
+										  so->skipScanKey,
+										  so->currPos.buf);
+
+				maxoff = PageGetMaxOffsetNumber(page);
+
+				/* Now read the data */
+				if (_bt_readpage(scan, ForwardScanDirection, startOffset))
+				{
+					ItemPointer resultTids, verifyTids;
+					int nresult = 1,
+						nverify = 1;
+
+					currItem = &so->currPos.items[so->currPos.itemIndex];
+					verifiedItup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+					/*
+					 * We need to keep in mind that tuples we deal with
+					 * could be also posting tuples and represent a list of
+					 * tids.
+					 */
+					if (BTreeTupleIsPosting(verifiedItup))
+					{
+						nverify = BTreeTupleGetNPosting(verifiedItup);
+						verifyTids = BTreeTupleGetPosting(verifiedItup);
+						for (int i = 1; i < nverify; i++)
+							verifyTids[i] = *BTreeTupleGetPostingN(verifiedItup, i);
+					}
+					else
+						verifyTids = &verifiedItup->t_tid;
+
+					if (BTreeTupleIsPosting(itup))
+					{
+						nresult = BTreeTupleGetNPosting(itup);
+						resultTids = BTreeTupleGetPosting(itup);
+						for (int i = 1; i < nresult; i++)
+							resultTids[i] = *BTreeTupleGetPostingN(itup, i);
+					}
+					else
+						resultTids = &itup->t_tid;
+
+					/* One not equal means they're not equal. */
+					for(int i = 0; i < nverify; i++)
+					{
+						for(int j = 0; j < nresult; j++)
+						{
+							if (!ItemPointerEquals(&resultTids[j], &verifyTids[i]))
+							{
+								nextFound = true;
+								break;
+							}
+						}
+					}
+
+					if (!nextFound)
+						offnum = jumpOffset;
+				}
+
+				if ((offnum > maxoff) && (so->currPos.nextPage == P_NONE))
+				{
+					_bt_unlockbuf(indexRel, so->currPos.buf);
+
+					BTScanPosUnpinIfPinned(so->currPos);
+					BTScanPosInvalidate(so->currPos);
+
+					pfree(so->skipScanKey);
+					so->skipScanKey = NULL;
+					return false;
+				}
+			}
+			else
+				/*
+				 * If startItup could be not found within the current page,
+				 * assume we found something new
+				 */
+				nextFound = true;
+
+			/* Return original scankey options */
+			so->skipScanKey->keysz = prefix;
+			so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+		}
+	}
+
 	/* Now read the data */
-	if (!_bt_readpage(scan, dir, offnum))
+	if (!_bt_readpage(scan, indexdir, offnum))
 	{
 		/*
 		 * There's no actually-matching data on this page.  Try to advance to
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 4248912239..b3726a54f3 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -64,7 +64,7 @@
 #include "utils/rel.h"
 #include "utils/syscache.h"
 
-static bool IndexSupportsBackwardScan(Plan *node);
+static bool IndexSupportsBackwardScan(Oid indexid);
 
 
 /*
@@ -557,8 +557,10 @@ ExecSupportsBackwardScan(Plan *node)
 			return false;
 
 		case T_IndexScan:
+			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid);
+
 		case T_IndexOnlyScan:
-			return IndexSupportsBackwardScan(node);
+			return IndexSupportsBackwardScan(((IndexOnlyScan *) node)->indexid);
 
 		case T_SubqueryScan:
 			return ExecSupportsBackwardScan(((SubqueryScan *) node)->subplan);
@@ -602,38 +604,16 @@ ExecSupportsBackwardScan(Plan *node)
 
 /*
  * An IndexScan or IndexOnlyScan node supports backward scan only if the
- * index's AM does and no skip scan is used.
+ * index's AM does.
  */
 static bool
-IndexSupportsBackwardScan(Plan *node)
+IndexSupportsBackwardScan(Oid indexid)
 {
 	bool		result;
-	Oid			indexid = InvalidOid;
-	int         skip_prefix_size = 0;
 	HeapTuple	ht_idxrel;
 	Form_pg_class idxrelrec;
 	IndexAmRoutine *amroutine;
 
-	Assert(IsA(node, IndexScan) || IsA(node, IndexOnlyScan));
-	switch(nodeTag(node))
-	{
-		case T_IndexScan:
-			indexid = ((IndexScan *) node)->indexid;
-			break;
-
-		case T_IndexOnlyScan:
-			indexid = ((IndexOnlyScan *) node)->indexid;
-			skip_prefix_size = ((IndexOnlyScan *) node)->indexskipprefixsize;
-			break;
-
-		default:
-			elog(DEBUG2, "unrecognized node type: %d", (int) nodeTag(node));
-			break;
-	}
-
-	if (skip_prefix_size > 0)
-		return false;
-
 	/* Fetch the pg_class tuple of the index relation */
 	ht_idxrel = SearchSysCache1(RELOID, ObjectIdGetDatum(indexid));
 	if (!HeapTupleIsValid(ht_idxrel))
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index e800e43e35..c9742792a5 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -67,6 +67,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid = NULL;
+	ItemPointerData startTid;
 	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
 
 	/*
@@ -75,6 +76,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	 */
 	bool skipped = false;
 
+	/*
+	 * Index only scan must be aware that in case of skipping we can return to
+	 * the starting point due to visibility checks. In this situation we need
+	 * to jump further, and number of skipping attempts tell us how far do we
+	 * need to do so.
+	 */
+	int skipAttempts = 0;
+
 	/*
 	 * extract necessary information from index scan node
 	 */
@@ -123,13 +132,27 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
 	}
+	else
+	{
+		ItemPointerCopy(&scandesc->xs_heaptid, &startTid);
+	}
 
 	/*
 	 * Check if we need to skip to the next key prefix.
+	 *
+	 * When fetching a cursor in the direction opposite to a general scan
+	 * direction, the result must be what normal fetching should have
+	 * returned, but in reversed order. In other words, return the last or
+	 * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+	 * Due to that we skip also when the first tuple wasn't emitted yet, but
+	 * the directions are opposite.
 	 */
-	if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+	if (node->ioss_SkipPrefixSize > 0 &&
+		(node->ioss_FirstTupleEmitted ||
+		 ScanDirectionsAreOpposite(direction, indexonlyscan->indexorderdir)))
 	{
-		if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+		if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir,
+						!node->ioss_FirstTupleEmitted, node->ioss_SkipPrefixSize))
 		{
 			/*
 			 * Reached end of index. At this point currPos is invalidated, and
@@ -144,6 +167,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		else
 		{
 			skipped = true;
+			skipAttempts = 1;
 			tid = &scandesc->xs_heaptid;
 		}
 	}
@@ -161,6 +185,35 @@ IndexOnlyNext(IndexOnlyScanState *node)
 
 		skipped = false;
 
+		/*
+		 * If we already emitted first tuple, while doing index only skip scan
+		 * with advancing and reading in different directions we can return to
+		 * the same position where we started after visibility check. Recognize
+		 * such situations and skip more.
+		 */
+		if ((readDirection != direction) && node->ioss_FirstTupleEmitted &&
+			ItemPointerIsValid(&startTid) && ItemPointerEquals(&startTid, tid))
+		{
+			int i;
+			skipAttempts += 1;
+
+			for (i = 0; i < skipAttempts; i++)
+			{
+				if (!index_skip(scandesc, direction,
+								indexonlyscan->indexorderdir,
+								!node->ioss_FirstTupleEmitted,
+								node->ioss_SkipPrefixSize))
+				{
+					node->ioss_FirstTupleEmitted = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			tid = &scandesc->xs_heaptid;
+		}
+
+		skipped = false;
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index b1e82f4fb5..eacf890dbe 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -176,6 +176,7 @@ typedef bool (*amgettuple_function) (IndexScanDesc scan,
 /* skip past duplicates in a given prefix */
 typedef bool (*amskip_function) (IndexScanDesc scan,
 								 ScanDirection dir,
+								 ScanDirection indexdir,
 								 int prefix);
 
 /* fetch all valid tuples */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index e757f01b0d..7cbab5f31a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -183,7 +183,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
 												   IndexBulkDeleteResult *istat);
 extern bool index_can_return(Relation indexRelation, int attno);
-extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction,
+					   ScanDirection indexdir, bool start, int prefix);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 8a45745368..c0feabf056 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1232,7 +1232,8 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir,
+					 ScanDirection indexdir, int prefix);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 							   Snapshot snapshot);
 
@@ -1257,7 +1258,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
 extern bytea *btoptions(Datum reloptions, bool validate);
-extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir,
+				   ScanDirection indexdir, int prefix);
 extern bool btproperty(Oid index_oid, int attno,
 					   IndexAMProperty prop, const char *propname,
 					   bool *res, bool *isnull);
-- 
2.26.3

v39-0005-Index-skip-scan-for-IndexScan.patchtext/x-diff; charset=us-asciiDownload

From 5b340231b05ec6577baa48dc0af5dfc7e1a6c8e6 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 14 May 2021 19:22:06 +0200
Subject: [PATCH v39 5/5] Index skip scan for IndexScan

Introduce Skip Scan support for IndexScan, not only for IndexOnlyScan.
It works in the same way as IndexOnlyScan, but planned has to check that
the chosen index is fully covering specified distinct expressions.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 src/backend/commands/explain.c                |  6 ++
 src/backend/executor/nodeIndexscan.c          | 56 +++++++++++++++-
 src/backend/nodes/copyfuncs.c                 |  1 +
 src/backend/nodes/outfuncs.c                  |  1 +
 src/backend/nodes/readfuncs.c                 |  1 +
 src/backend/optimizer/path/indxpath.c         | 59 ++++++++++++++++-
 src/backend/optimizer/plan/createplan.c       | 10 ++-
 src/include/nodes/execnodes.h                 |  4 ++
 src/include/nodes/plannodes.h                 |  2 +
 src/test/regress/expected/select_distinct.out | 64 ++++++++++++++++---
 src/test/regress/sql/select_distinct.sql      | 16 +++++
 11 files changed, 206 insertions(+), 14 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 37ca8612f4..1c8597f5d3 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1752,6 +1752,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_IndexScan:
+			if (((IndexScan *) plan)->indexskipprefixsize > 0)
+			{
+				IndexScan  *indexscan = (IndexScan *) plan;
+				ExplainPropertyBool("Skip scan", true, es);
+				ExplainIndexSkipScanKeys(indexscan->indexskipprefixsize, es);
+			}
 			show_scan_qual(((IndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexScan *) plan)->indexqualorig)
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 2fffb1b437..71aac4493d 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,6 +85,13 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
+
+	/*
+	 * tells if the current position was reached via skipping. In this case
+	 * there is no nead for the index_getnext_tid
+	 */
+	bool skipped = false;
 
 	/*
 	 * extract necessary information from index scan node
@@ -92,7 +99,7 @@ IndexNext(IndexScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -117,6 +124,12 @@ IndexNext(IndexScanState *node)
 
 		node->iss_ScanDesc = scandesc;
 
+		/* Index skip scan assumes xs_want_itup, so set it to true */
+		if (indexscan->indexskipprefixsize > 0)
+			node->iss_ScanDesc->xs_want_itup = true;
+		else
+			node->iss_ScanDesc->xs_want_itup = false;
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -127,12 +140,48 @@ IndexNext(IndexScanState *node)
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	}
 
+	/*
+	 * Check if we need to skip to the next key prefix, because we've been
+	 * asked to implement DISTINCT.
+	 *
+	 * When fetching a cursor in the direction opposite to a general scan
+	 * direction, the result must be what normal fetching should have
+	 * returned, but in reversed order. In other words, return the last or
+	 * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+	 * Due to that we skip also when the first tuple wasn't emitted yet, but
+	 * the directions are opposite.
+	 */
+	if (node->iss_SkipPrefixSize > 0 &&
+		(node->iss_FirstTupleEmitted ||
+		 ScanDirectionsAreOpposite(direction, indexscan->indexorderdir)))
+	{
+		if (!index_skip(scandesc, direction, indexscan->indexorderdir,
+					   !node->iss_FirstTupleEmitted, node->iss_SkipPrefixSize))
+		{
+			/*
+			 * Reached end of index. At this point currPos is invalidated, and
+			 * we need to reset iss_FirstTupleEmitted, since otherwise after
+			 * going backwards, reaching the end of index, and going forward
+			 * again we apply skip again. It would be incorrect and lead to an
+			 * extra skipped item.
+			 */
+			node->iss_FirstTupleEmitted = false;
+			return ExecClearTuple(slot);
+		}
+		else
+		{
+			skipped = true;
+			index_fetch_heap(scandesc, slot);
+		}
+	}
+
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (skipped || index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
+		skipped = false;
 
 		/*
 		 * If the index was lossy, we have to recheck the index quals using
@@ -149,6 +198,7 @@ IndexNext(IndexScanState *node)
 			}
 		}
 
+		node->iss_FirstTupleEmitted = true;
 		return slot;
 	}
 
@@ -910,6 +960,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+	indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->iss_FirstTupleEmitted = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 320d648540..4cb9a34e4a 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -492,6 +492,7 @@ _copyIndexScan(const IndexScan *from)
 	COPY_NODE_FIELD(indexorderbyorig);
 	COPY_NODE_FIELD(indexorderbyops);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index a6daf3847a..cb5f414391 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -561,6 +561,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
 	WRITE_NODE_FIELD(indexorderbyorig);
 	WRITE_NODE_FIELD(indexorderbyops);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index ab5f237305..d0878d7b54 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1868,6 +1868,7 @@ _readIndexScan(void)
 	READ_NODE_FIELD(indexorderbyorig);
 	READ_NODE_FIELD(indexorderbyops);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 2622f6e389..580a774b10 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -1036,7 +1036,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 					   check_index_only(rel, index));
 
 	/* Check if an index skip scan is possible. */
-	can_skip = enable_indexskipscan & index->amcanskip & index_only_scan;
+	can_skip = enable_indexskipscan & index->amcanskip;
 
 	if (can_skip)
 	{
@@ -1099,6 +1099,63 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				}
 			}
 		}
+
+		/*
+		 * For an index scan verify that index fully covers distinct
+		 * expressions, otherwise there is not enough information for skipping
+		 */
+		if (!index_only_scan && root->query_uniquekeys != NULL)
+		{
+			ListCell *lc;
+
+			foreach(lc, root->query_uniquekeys)
+			{
+				List *uniqExprs = (List *) lfirst(lc);
+				ListCell *lc1;
+
+				foreach(lc1, uniqExprs)
+				{
+					Expr *expr = (Expr *) lfirst(lc1);
+					bool found = false;
+
+					if (!IsA(expr, Var))
+					{
+						ListCell *lc2;
+
+						foreach(lc2, index->indexprs)
+						{
+							if(equal(lfirst(lc1), lfirst(lc2)))
+							{
+								found = true;
+								break;
+							}
+						}
+					}
+					else
+					{
+						Var *var = (Var *) expr;
+
+						for (int i = 0; i < index->ncolumns; i++)
+						{
+							if (index->indexkeys[i] == var->varattno)
+							{
+								found = true;
+								break;
+							}
+						}
+					}
+
+					if (!found)
+					{
+						can_skip = false;
+						break;
+					}
+				}
+
+				if (!can_skip)
+					break;
+			}
+		}
 	}
 
 	/*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 513466f7b6..df09f1d766 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -185,7 +185,8 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 								 Oid indexid, List *indexqual, List *indexqualorig,
 								 List *indexorderby, List *indexorderbyorig,
 								 List *indexorderbyops,
-								 ScanDirection indexscandir);
+								 ScanDirection indexscandir,
+								 int skipprefix);
 static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 Index scanrelid, Oid indexid,
 										 List *indexqual, List *indexorderby,
@@ -3080,7 +3081,8 @@ create_indexscan_plan(PlannerInfo *root,
 											fixed_indexorderbys,
 											indexorderbys,
 											indexorderbyops,
-											best_path->indexscandir);
+											best_path->indexscandir,
+											best_path->indexskipprefix);
 
 	copy_generic_path_info(&scan_plan->plan, &best_path->path);
 
@@ -5412,7 +5414,8 @@ make_indexscan(List *qptlist,
 			   List *indexorderby,
 			   List *indexorderbyorig,
 			   List *indexorderbyops,
-			   ScanDirection indexscandir)
+			   ScanDirection indexscandir,
+			   int skipPrefixSize)
 {
 	IndexScan  *node = makeNode(IndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5429,6 +5432,7 @@ make_indexscan(List *qptlist,
 	node->indexorderbyorig = indexorderbyorig;
 	node->indexorderbyops = indexorderbyops;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index c480794f59..8bf6324b4c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1470,6 +1470,8 @@ typedef struct IndexScanState
 	ExprContext *iss_RuntimeContext;
 	Relation	iss_RelationDesc;
 	struct IndexScanDescData *iss_ScanDesc;
+	int         iss_SkipPrefixSize;
+	bool		iss_FirstTupleEmitted;
 
 	/* These are needed for re-checking ORDER BY expr ordering */
 	pairingheap *iss_ReorderQueue;
@@ -1499,6 +1501,8 @@ typedef struct IndexScanState
  *		TableSlot		   slot for holding tuples fetched from the table
  *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
+ *		SkipPrefixSize	   number of keys for skip-based DISTINCT
+ *		FirstTupleEmitted  has the first tuple been emitted
  * ----------------
  */
 typedef struct IndexOnlyScanState
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 530e9ab875..a682514adc 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -407,6 +407,8 @@ typedef struct IndexScan
 	List	   *indexorderbyorig;	/* the same in original form */
 	List	   *indexorderbyops;	/* OIDs of sort ops for ORDER BY exprs */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for distinct
+										 * scans */
 } IndexScan;
 
 /* ----------------
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index da500365fb..1c71180110 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -376,14 +376,12 @@ SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
 -- test index skip scan for expressions
 EXPLAIN (COSTS OFF)
 SELECT DISTINCT (a + 1) FROM distinct_a ORDER BY (a + 1);
-             QUERY PLAN             
-------------------------------------
- Sort
-   Sort Key: ((a + 1))
-   ->  HashAggregate
-         Group Key: (a + 1)
-         ->  Seq Scan on distinct_a
-(5 rows)
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Scan using distinct_a_expr_idx on distinct_a
+   Skip scan: true
+   Distinct Prefix: 1
+(3 rows)
 
 SELECT DISTINCT (a + 1) FROM distinct_a ORDER BY (a + 1);
  ?column? 
@@ -624,6 +622,56 @@ FETCH BACKWARD ALL FROM c;
 
 END;
 DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+ 2 | 1 | 10
+ 3 | 1 | 10
+ 4 | 1 | 10
+ 5 | 1 | 10
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+   Distinct Prefix: 1
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+   Distinct Prefix: 1
+   Index Cond: (a = 1)
+(4 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT *
+FROM distinct_a;
+          QUERY PLAN          
+------------------------------
+ HashAggregate
+   Group Key: a, b, c
+   ->  Seq Scan on distinct_a
+(3 rows)
+
 -- check colums order
 SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
  a 
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index 797ed6e3dc..be7da7fd9c 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -255,6 +255,22 @@ END;
 
 DROP TABLE distinct_abc;
 
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT *
+FROM distinct_a;
+
 -- check colums order
 SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
 
-- 
2.26.3

#35

[1]: /messages/by-id/CAH2-WzmUscvoxVkokHxP=uPTDjSi0tJkFpUPD-CeA35dvn-CMw@mail.gmail.com

9erthalion6@gmail.com

almost 4 years ago

In reply to: Dmitry Dolgov (#34)

6 attachment(s)

Re: Index Skip Scan (new UniqueKeys)

On Fri, May 21, 2021 at 05:31:38PM +0200, Dmitry Dolgov wrote:
Hi,

Here is another take on the patch with a couple of changes:

* I've removed for now UniqueKeys parts. The interaction of skip scan &
unique keys patch was actually not that big, so the main difference is
that now the structure itself went away, a list of unique expressions
is used instead. All the suggestions about how this feature should
look like from the planning perspective are still there. On the one
hand it will allow to develop both patches independently and avoid
confusion for reviewers, on the other UniqueKeys could be easily
incorporated back when needed.

* Support for skipping in case of moving backward on demand (scroll
cursor) is moved into a separate patch. This is implemented via
returning false from IndexSupportsBackwardScan in case if it's a skip
scan node, which in turn adds Materialize node on top when needed. The
name SupportsBackwardScan was a bit confusing for me, but it seems
it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL.
Eventually those cases when BackwardScanDirection is used are still
handled by amskip. This change didn't affect the test coverage, all
the test cases supported in previous patch versions are still there.

About Materialize node, I guess it could be less performant than a
"native" support, but it simplifies the implementation significantly
to the point that most parts, which were causing questions before, are
now located in the isolated patch. My idea here is to concentrate
efforts on the first three patches in this series, and consider the
rest of them as an experiment field.

* IndexScan support was also relocated into a separate patch, the first
three patches are now only about IndexOnlyScan.

* Last bits of reviews were incorporated and rebased.

While the patch is still waiting for a review, I was motivated by the
thread [1]/messages/by-id/CAH2-WzmUscvoxVkokHxP=uPTDjSi0tJkFpUPD-CeA35dvn-CMw@mail.gmail.com to think about it from the interface point of view. Consider
an index skip scan being just like a normal index scan with a set of
underspecified leading search keys. It makes sense to have the same
structure "begin scan" -> "get the next tuple" -> "end scan" (now I'm
not sure if amskip is a good name to represent that, don't have anything
better yet). But the "underspecified" part is currently indeed
interpreted in a limited way -- as "missing" keys -- and is expressed
only via the prefix size. Another option would be e.g. leading keys
constrained by a range of values, so generally speaking it makes sense
to extend amount of the information provided for skipping.

As a naive approach I've added a new patch into the series, containing
the extra data structure (ScanLooseKeys, doesn't have much meaning yet
except somehow representing keys for skipping) used for index skip scan.
Any thoughts about it?

Besides that the new patch version contains some cleaning up and
addresses commentaries around leaf page pinning from [1]/messages/by-id/CAH2-WzmUscvoxVkokHxP=uPTDjSi0tJkFpUPD-CeA35dvn-CMw@mail.gmail.com. The idea
behind the series structure is still the same: the first three patches
contains the essence of the implementation (hoping to help concentrate
review), the rest are more "experimental".

Attachments:

v40-0001-Unique-expressions.patchtext/x-diff; charset=us-asciiDownload

From 8b61823e523ceecbb2fd6faa1a229038bb981594 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 17 May 2021 11:47:07 +0200
Subject: [PATCH v40 1/6] Unique expressions

Extend PlannerInfo and Path structures with the list of relevant unique
expressions. It specifies which keys must be unique on the query
level, and allows to leverage this into on the path level. At the moment
only distinctClause makes use of such mechanism, which enables potential
use of index skip scan.

Originally proposed by David Rowley, based on the UniqueKey patch
implementation from Andy Fan, contains few bits out of previous version
from Jesper Pedersen, Floris Van Nee.
---
 src/backend/nodes/list.c                | 31 +++++++++
 src/backend/optimizer/path/Makefile     |  3 +-
 src/backend/optimizer/path/pathkeys.c   | 62 +++++++++++++++++
 src/backend/optimizer/path/uniquekeys.c | 92 +++++++++++++++++++++++++
 src/backend/optimizer/plan/planner.c    | 36 +++++++++-
 src/backend/optimizer/util/pathnode.c   | 32 ++++++---
 src/include/nodes/pathnodes.h           |  5 ++
 src/include/nodes/pg_list.h             |  1 +
 src/include/optimizer/pathnode.h        |  1 +
 src/include/optimizer/paths.h           |  9 +++
 10 files changed, 261 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index f843f861ef..a53a50f372 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -1653,3 +1653,34 @@ list_oid_cmp(const ListCell *p1, const ListCell *p2)
 		return 1;
 	return 0;
 }
+
+/*
+ * Return true iff every entry in "members" list is also present
+ * in the "target" list.
+ */
+bool
+list_is_subset(const List *members, const List *target)
+{
+	const ListCell	*lc1, *lc2;
+
+	Assert(IsPointerList(members));
+	Assert(IsPointerList(target));
+	check_list_invariants(members);
+	check_list_invariants(target);
+
+	foreach(lc1, members)
+	{
+		bool found = false;
+		foreach(lc2, target)
+		{
+			if (equal(lfirst(lc1), lfirst(lc2)))
+			{
+				found = true;
+				break;
+			}
+		}
+		if (!found)
+			return false;
+	}
+	return true;
+}
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..7b9820c25f 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekeys.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 86a35cdef1..e2be1fbf90 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
 #include "utils/lsyscache.h"
 
 
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
@@ -96,6 +97,29 @@ make_canonical_pathkey(PlannerInfo *root,
 	return pk;
 }
 
+/*
+ * pathkey_is_unique
+ *		Checks if the new pathkey's equivalence class is the same as that of
+ *		any existing member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+	EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+	ListCell   *lc;
+
+	/* If same EC already is already in the list, then not unique */
+	foreach(lc, pathkeys)
+	{
+		PathKey    *old_pathkey = (PathKey *) lfirst(lc);
+
+		if (new_ec == old_pathkey->pk_eclass)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * pathkey_is_redundant
  *	   Is a pathkey redundant with one already in the given list?
@@ -1152,6 +1176,44 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
 	return pathkeys;
 }
 
+/*
+ * make_pathkeys_for_uniquekeyclauses
+ *		Generate a pathkeys list to be used for uniquekey clauses
+ */
+List *
+make_pathkeys_for_uniquekeys(PlannerInfo *root,
+							 List *sortclauses,
+							 List *tlist)
+{
+	List	   *pathkeys = NIL;
+	ListCell   *l;
+
+	foreach(l, sortclauses)
+	{
+		SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+		Expr	   *sortkey;
+		PathKey    *pathkey;
+
+		sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+		Assert(OidIsValid(sortcl->sortop));
+		pathkey = make_pathkey_from_sortop(root,
+										   sortkey,
+										   root->nullable_baserels,
+										   sortcl->sortop,
+										   sortcl->nulls_first,
+										   sortcl->tleSortGroupRef,
+										   true);
+
+		if (EC_MUST_BE_REDUNDANT(pathkey->pk_eclass))
+			continue;
+
+		if (pathkey_is_unique(pathkey, pathkeys))
+			pathkeys = lappend(pathkeys, pathkey);
+	}
+
+	return pathkeys;
+}
+
 /****************************************************************************
  *		PATHKEYS AND MERGECLAUSES
  ****************************************************************************/
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
new file mode 100644
index 0000000000..d2525771e3
--- /dev/null
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -0,0 +1,92 @@
+/*-------------------------------------------------------------------------
+ *
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/uniquekeys.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "optimizer/paths.h"
+
+/*
+ * build_uniquekeys
+ * 		Preparing list of pathkeys keys which are considered to be unique for
+ * 		this query.
+ *
+ * For now used only for distinct clauses, where redundant keys	need to be
+ * preserved e.g. for skip scan. Justification for this function existence is
+ * future plans to make it produce actual UniqueKey list. 
+ */
+List*
+build_uniquekeys(PlannerInfo *root, List *sortclauses)
+{
+	List *result = NIL;
+	List *sortkeys;
+	ListCell *l;
+	List *exprs = NIL;
+
+	sortkeys = make_pathkeys_for_uniquekeys(root,
+											sortclauses,
+											root->processed_tlist);
+
+	/* Create a uniquekey and add it to the list */
+	foreach(l, sortkeys)
+	{
+		PathKey    *pathkey = (PathKey *) lfirst(l);
+		EquivalenceClass *ec = pathkey->pk_eclass;
+		EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members));
+		exprs = lappend(exprs, mem->em_expr);
+	}
+
+	result = lappend(result, exprs);
+
+	return result;
+}
+
+/*
+ * query_has_uniquekeys_for
+ * 		Check if the specified unique keys matching all query level unique
+ * 		keys.
+ *
+ * The main use is to verify that unique keys for some path are covering all
+ * requested query unique keys. Based on this information a path could be
+ * rejected if it satisfy uniqueness only partially.
+ */
+bool
+query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys,
+						 bool allow_multinulls)
+{
+	ListCell *lc;
+	ListCell *lc2;
+
+	/* root->query_uniquekeys are the requested DISTINCT clauses on query level
+	 * path_uniquekeys are the unique keys on current path. All requested
+	 * query_uniquekeys must be satisfied by the path_uniquekeys.
+	 */
+	foreach(lc, root->query_uniquekeys)
+	{
+		List *query_ukey = lfirst_node(List, lc);
+		bool satisfied = false;
+		foreach(lc2, path_uniquekeys)
+		{
+			List *ukey = lfirst_node(List, lc2);
+			if (list_length(ukey) == 0 &&
+				list_length(query_ukey) != 0)
+				continue;
+			if (list_is_subset(ukey, query_ukey))
+			{
+				satisfied = true;
+				break;
+			}
+		}
+		if (!satisfied)
+			return false;
+	}
+	return true;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index bd09f85aea..b67bff8ccc 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3102,12 +3102,18 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 
 	if (parse->distinctClause &&
 		grouping_is_sortable(parse->distinctClause))
+	{
 		root->distinct_pathkeys =
 			make_pathkeys_for_sortclauses(root,
 										  parse->distinctClause,
 										  tlist);
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
+	{
 		root->distinct_pathkeys = NIL;
+		root->query_uniquekeys = NIL;
+	}
 
 	root->sort_pathkeys =
 		make_pathkeys_for_sortclauses(root,
@@ -4493,13 +4499,19 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
 			Path	   *path = (Path *) lfirst(lc);
 
 			if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
-			{
 				add_path(distinct_rel, (Path *)
 						 create_upper_unique_path(root, distinct_rel,
 												  path,
 												  list_length(root->distinct_pathkeys),
 												  numDistinctRows));
-			}
+		}
+
+		foreach(lc, input_rel->unique_pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+
+			if (query_has_uniquekeys_for(root, path->uniquekeys, false))
+				add_path(distinct_rel, path);
 		}
 
 		/* For explicit-sort case, always use the more rigorous clause */
@@ -7109,6 +7121,26 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		}
 	}
 
+	foreach(lc, rel->unique_pathlist)
+	{
+		Path	   *subpath = (Path *) lfirst(lc);
+
+		/* Shouldn't have any parameterized paths anymore */
+		Assert(subpath->param_info == NULL);
+
+		if (tlist_same_exprs)
+			subpath->pathtarget->sortgrouprefs =
+				scanjoin_target->sortgrouprefs;
+		else
+		{
+			Path	   *newpath;
+
+			newpath = (Path *) create_projection_path(root, rel, subpath,
+													  scanjoin_target);
+			lfirst(lc) = newpath;
+		}
+	}
+
 	/*
 	 * Now, if final scan/join target contains SRFs, insert ProjectSetPath(s)
 	 * atop each existing path.  (Note that this function doesn't look at the
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 5c32c96b71..abb77d867e 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -416,10 +416,10 @@ set_cheapest(RelOptInfo *parent_rel)
  * 'parent_rel' is the relation entry to which the path corresponds.
  * 'new_path' is a potential path for parent_rel.
  *
- * Returns nothing, but modifies parent_rel->pathlist.
+ * Returns modified pathlist.
  */
-void
-add_path(RelOptInfo *parent_rel, Path *new_path)
+static List *
+add_path_to(RelOptInfo *parent_rel, List *pathlist, Path *new_path)
 {
 	bool		accept_new = true;	/* unless we find a superior old path */
 	int			insert_at = 0;	/* where to insert new item */
@@ -440,7 +440,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	 * for more than one old path to be tossed out because new_path dominates
 	 * it.
 	 */
-	foreach(p1, parent_rel->pathlist)
+	foreach(p1, pathlist)
 	{
 		Path	   *old_path = (Path *) lfirst(p1);
 		bool		remove_old = false; /* unless new proves superior */
@@ -584,8 +584,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		 */
 		if (remove_old)
 		{
-			parent_rel->pathlist = foreach_delete_current(parent_rel->pathlist,
-														  p1);
+			pathlist = foreach_delete_current(pathlist, p1);
 
 			/*
 			 * Delete the data pointed-to by the deleted cell, if possible
@@ -612,8 +611,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	if (accept_new)
 	{
 		/* Accept the new path: insert it at proper place in pathlist */
-		parent_rel->pathlist =
-			list_insert_nth(parent_rel->pathlist, insert_at, new_path);
+		pathlist = list_insert_nth(pathlist, insert_at, new_path);
 	}
 	else
 	{
@@ -621,6 +619,23 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		if (!IsA(new_path, IndexPath))
 			pfree(new_path);
 	}
+
+	return pathlist;
+}
+
+void
+add_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->pathlist = add_path_to(parent_rel,
+									   parent_rel->pathlist, new_path);
+}
+
+void
+add_unique_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->unique_pathlist = add_path_to(parent_rel,
+											  parent_rel->unique_pathlist,
+											  new_path);
 }
 
 /*
@@ -2662,6 +2677,7 @@ create_projection_path(PlannerInfo *root,
 	pathnode->path.pathkeys = subpath->pathkeys;
 
 	pathnode->subpath = subpath;
+	pathnode->path.uniquekeys = subpath->uniquekeys;
 
 	/*
 	 * We might not need a separate Result node.  If the input plan node type
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 1f3845b3fe..056b13826a 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -293,6 +293,7 @@ struct PlannerInfo
 
 	List	   *query_pathkeys; /* desired pathkeys for query_planner() */
 
+	List	   *query_uniquekeys; /* unique keys required for the query */
 	List	   *group_pathkeys; /* groupClause pathkeys, if any */
 	List	   *window_pathkeys;	/* pathkeys of bottom window, if any */
 	List	   *distinct_pathkeys;	/* distinctClause pathkeys, if any */
@@ -695,6 +696,7 @@ typedef struct RelOptInfo
 	List	   *pathlist;		/* Path structures */
 	List	   *ppilist;		/* ParamPathInfos used in pathlist */
 	List	   *partial_pathlist;	/* partial Paths */
+	List       *unique_pathlist;    /* unique Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
 	struct Path *cheapest_unique_path;
@@ -883,6 +885,7 @@ struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanskip;		/* can AM skip duplicate values? */
 	bool		amcanparallel;	/* does AM support parallel scan? */
 	bool		amcanmarkpos;	/* does AM support mark/restore? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
@@ -1196,6 +1199,8 @@ typedef struct Path
 
 	List	   *pathkeys;		/* sort ordering of path's output */
 	/* pathkeys is a List of PathKey nodes; see above */
+
+	List	   *uniquekeys;	/* the unique keys, or NIL if none */
 } Path;
 
 /* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/nodes/pg_list.h b/src/include/nodes/pg_list.h
index 2cb9d1371d..4ac871fd16 100644
--- a/src/include/nodes/pg_list.h
+++ b/src/include/nodes/pg_list.h
@@ -567,6 +567,7 @@ extern pg_nodiscard List *list_delete_last(List *list);
 extern pg_nodiscard List *list_delete_first_n(List *list, int n);
 extern pg_nodiscard List *list_delete_nth_cell(List *list, int n);
 extern pg_nodiscard List *list_delete_cell(List *list, ListCell *cell);
+extern bool list_is_subset(const List *members, const List *target);
 
 extern List *list_union(const List *list1, const List *list2);
 extern List *list_union_ptr(const List *list1, const List *list2);
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 620eeda2d6..bb6d730e93 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -27,6 +27,7 @@ extern int	compare_fractional_path_costs(Path *path1, Path *path2,
 										  double fraction);
 extern void set_cheapest(RelOptInfo *parent_rel);
 extern void add_path(RelOptInfo *parent_rel, Path *new_path);
+extern void add_unique_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_path_precheck(RelOptInfo *parent_rel,
 							  Cost startup_cost, Cost total_cost,
 							  List *pathkeys, Relids required_outer);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 0c3a0b90c8..3dfa21adad 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -229,6 +229,9 @@ extern List *build_join_pathkeys(PlannerInfo *root,
 extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
 										   List *sortclauses,
 										   List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+										  List *sortclauses,
+										  List *tlist);
 extern void initialize_mergeclause_eclasses(PlannerInfo *root,
 											RestrictInfo *restrictinfo);
 extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -255,4 +258,10 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 									List *live_childrels);
 
+extern bool query_has_uniquekeys_for(PlannerInfo *root,
+									 List *exprs,
+									 bool allow_multinulls);
+
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+
 #endif							/* PATHS_H */
-- 
2.32.0

v40-0002-Index-skip-scan.patchtext/x-diff; charset=us-asciiDownload

From f56feb75d069522ccbc19275a91ebc0859d73104 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sat, 8 Jan 2022 17:16:49 +0100
Subject: [PATCH v40 2/6] Index skip scan

Allow IndexOnlyScan to skip duplicated tuples based on search key prefix
(a trick also known as Index Skip Scan or Loose Index Scan, see in the
wiki [1]). The idea is to avoid scanning all equal values of a key, as
soon as a new value is found, restart the search by looking for a larger
value. This approach is much faster when the index has many equal keys.

Implemented via equipping IndexPath with indexskipprefix field and
creating an extra IndexPath with such prefix if suitable unique
expressions are present. On the execution size a new index am function
amskip is introduced to provide index specific implementation for such
skipping. To simplify potential amskip implementations,
ExecSupportsBackwardScan now returns false in case if index skip scan is
used, otherwise amskip has to deal with scroll cursor and be prepared to
handle different advance/read directions. ExecSupportsBackwardScan may
seem to have too big scope, but looks like now it used only together
with cursorOptions checks for CURSOR_OPT_SCROLL.

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 contrib/bloom/blutils.c                       |   1 +
 doc/src/sgml/config.sgml                      |  15 ++
 doc/src/sgml/indexam.sgml                     |  43 ++++++
 doc/src/sgml/indices.sgml                     |  23 +++
 src/backend/access/brin/brin.c                |   1 +
 src/backend/access/gin/ginutil.c              |   1 +
 src/backend/access/gist/gist.c                |   1 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/index/indexam.c            |  16 ++
 src/backend/access/spgist/spgutils.c          |   1 +
 src/backend/commands/explain.c                |  23 +++
 src/backend/executor/execAmi.c                |  32 +++-
 src/backend/executor/nodeIndexonlyscan.c      |  47 +++++-
 src/backend/nodes/copyfuncs.c                 |   1 +
 src/backend/nodes/outfuncs.c                  |   1 +
 src/backend/nodes/readfuncs.c                 |   1 +
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/optimizer/path/indxpath.c         | 140 +++++++++++++++++-
 src/backend/optimizer/path/pathkeys.c         |  54 ++++++-
 src/backend/optimizer/plan/createplan.c       |  10 +-
 src/backend/optimizer/util/pathnode.c         |  37 +++++
 src/backend/optimizer/util/plancat.c          |   1 +
 src/backend/utils/misc/guc.c                  |   9 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/amapi.h                    |   6 +
 src/include/access/genam.h                    |   1 +
 src/include/access/sdir.h                     |   7 +
 src/include/nodes/execnodes.h                 |   2 +
 src/include/nodes/pathnodes.h                 |   4 +
 src/include/nodes/plannodes.h                 |   2 +
 src/include/optimizer/cost.h                  |   1 +
 src/include/optimizer/pathnode.h              |   4 +
 src/include/optimizer/paths.h                 |   5 +-
 src/test/regress/expected/sysviews.out        |   3 +-
 34 files changed, 477 insertions(+), 19 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index a434cf93ef..3b312c039d 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -134,6 +134,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = blcostestimate;
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4cd9818acf..47663eb583 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5012,6 +5012,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+      <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of index-skip-scan plan
+        types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+        <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-material" xreflabel="enable_material">
       <term><varname>enable_material</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 84de931071..ffc884a038 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -153,6 +153,7 @@ typedef struct IndexAmRoutine
     amendscan_function amendscan;
     ammarkpos_function ammarkpos;       /* can be NULL */
     amrestrpos_function amrestrpos;     /* can be NULL */
+    amskip_function amskip;             /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -779,6 +780,48 @@ amrestrpos (IndexScanDesc scan);
 
   <para>
 <programlisting>
+bool
+amskip (IndexScanDesc scan,
+        ScanDirection direction,
+        int prefix);
+</programlisting>
+  Skip past all tuples where the first 'prefix' columns have the same value as
+  the last tuple returned in the current scan. The arguments are:
+
+   <variablelist>
+    <varlistentry>
+     <term><parameter>scan</parameter></term>
+     <listitem>
+      <para>
+       Index scan information
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>direction</parameter></term>
+     <listitem>
+      <para>
+       The direction in which data is advancing.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>prefix</parameter></term>
+     <listitem>
+      <para>
+        Distinct prefix size.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+
+  </para>
+
+  <para>
+<programlisting>
 Size
 amestimateparallelscan (void);
 </programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 023157d888..ab9595d37f 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1297,6 +1297,29 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
    and later will recognize such cases and allow index-only scans to be
    generated, but older versions will not.
   </para>
+
+  <sect2 id="indexes-index-skip-scans">
+    <title>Index Skip Scans</title>
+
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index</primary>
+      <secondary>index-skip scans</secondary>
+    </indexterm>
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index-skip scan</primary>
+    </indexterm>
+
+    <para>
+     When the rows retrieved from an index scan are then deduplicated by
+     eliminating rows matching on a prefix of index keys (e.g. when using
+     <literal>SELECT DISTINCT</literal>), the planner will consider
+     skipping groups of rows with a matching key prefix. When a row with
+     a particular prefix is found, remaining rows with the same key prefix
+     are skipped.  The larger the number of rows with the same key prefix
+     rows (i.e. the lower the number of distinct key prefixes in the index),
+     the more efficient this is.
+    </para>
+  </sect2>
  </sect1>
 
 
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ba78ecff66..d9ee7eea6d 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -119,6 +119,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = brincostestimate;
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 3d15701a01..56292eb822 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -67,6 +67,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = gincostestimate;
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index c3cdfca9a2..459de6ddc2 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -88,6 +88,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = gistcostestimate;
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index d48c8a4549..a9c02b6559 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -85,6 +85,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = hashcostestimate;
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index fe80b8b0ba..bcf7c73467 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_skip		- advance past duplicate key values in a scan
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -739,6 +740,21 @@ index_can_return(Relation indexRelation, int attno)
 	return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
 }
 
+/* ----------------
+ *		index_skip
+ *
+ *		Skip past all tuples where the first 'prefix' columns have the
+ *		same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+	SCAN_CHECKS;
+
+	return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
 /* ----------------
  *		index_getprocid
  *
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 1ae7492216..0b0dfa278f 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -73,6 +73,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = spgcostestimate;
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b970997c34..45e03c5207 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -152,6 +152,7 @@ static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
 static void ExplainIndentText(ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 
 
@@ -1114,6 +1115,21 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
 }
 
+/*
+ * ExplainIndexSkipScanKeys -
+ *	  Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es)
+{
+	if (skipPrefixSize > 0)
+	{
+		ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+	}
+}
+
 /*
  * ExplainNode -
  *	  Appends a description of a plan tree to es->str
@@ -1471,6 +1487,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
 
+
 				ExplainIndexScanDetails(indexonlyscan->indexid,
 										indexonlyscan->indexorderdir,
 										es);
@@ -1744,6 +1761,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 										   planstate, es);
 			break;
 		case T_IndexOnlyScan:
+			if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+			{
+				IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+				ExplainPropertyBool("Skip scan", true, es);
+				ExplainIndexSkipScanKeys(indexonlyscan->indexskipprefixsize, es);
+			}
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexOnlyScan *) plan)->recheckqual)
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b6245994f0..ced8933f44 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -64,7 +64,7 @@
 #include "utils/rel.h"
 #include "utils/syscache.h"
 
-static bool IndexSupportsBackwardScan(Oid indexid);
+static bool IndexSupportsBackwardScan(Plan *node);
 
 
 /*
@@ -555,10 +555,8 @@ ExecSupportsBackwardScan(Plan *node)
 			return false;
 
 		case T_IndexScan:
-			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid);
-
 		case T_IndexOnlyScan:
-			return IndexSupportsBackwardScan(((IndexOnlyScan *) node)->indexid);
+			return IndexSupportsBackwardScan(node);
 
 		case T_SubqueryScan:
 			return ExecSupportsBackwardScan(((SubqueryScan *) node)->subplan);
@@ -598,16 +596,38 @@ ExecSupportsBackwardScan(Plan *node)
 
 /*
  * An IndexScan or IndexOnlyScan node supports backward scan only if the
- * index's AM does.
+ * index's AM does and no skip scan is used.
  */
 static bool
-IndexSupportsBackwardScan(Oid indexid)
+IndexSupportsBackwardScan(Plan *node)
 {
 	bool		result;
+	Oid			indexid = InvalidOid;
+	int         skip_prefix_size = 0;
 	HeapTuple	ht_idxrel;
 	Form_pg_class idxrelrec;
 	IndexAmRoutine *amroutine;
 
+	Assert(IsA(node, IndexScan) || IsA(node, IndexOnlyScan));
+	switch(nodeTag(node))
+	{
+		case T_IndexScan:
+			indexid = ((IndexScan *) node)->indexid;
+			break;
+
+		case T_IndexOnlyScan:
+			indexid = ((IndexOnlyScan *) node)->indexid;
+			skip_prefix_size = ((IndexOnlyScan *) node)->indexskipprefixsize;
+			break;
+
+		default:
+			elog(DEBUG2, "unrecognized node type: %d", (int) nodeTag(node));
+			break;
+	}
+
+	if (skip_prefix_size > 0)
+		return false;
+
 	/* Fetch the pg_class tuple of the index relation */
 	ht_idxrel = SearchSysCache1(RELOID, ObjectIdGetDatum(indexid));
 	if (!HeapTupleIsValid(ht_idxrel))
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index eb3ddd2943..40ad1b949b 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -41,6 +41,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/itemptr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -62,9 +63,17 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	EState	   *estate;
 	ExprContext *econtext;
 	ScanDirection direction;
+	ScanDirection readDirection;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
-	ItemPointer tid;
+	ItemPointer tid = NULL;
+	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
+
+	/*
+	 * Tells if the current position was reached via skipping. In this case
+	 * there is no nead for the index_getnext_tid
+	 */
+	bool skipped = false;
 
 	/*
 	 * extract necessary information from index scan node
@@ -72,7 +81,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -115,15 +124,43 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_NumOrderByKeys);
 	}
 
+	/*
+	 * Check if we need to skip to the next key prefix.
+	 */
+	if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+	{
+		if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+		{
+			/*
+			 * Reached end of index. At this point currPos is invalidated, and
+			 * we need to reset ioss_FirstTupleEmitted, since otherwise after
+			 * going backwards, reaching the end of index, and going forward
+			 * again we apply skip again. It would be incorrect and lead to an
+			 * extra skipped item.
+			 */
+			node->ioss_FirstTupleEmitted = false;
+			return ExecClearTuple(slot);
+		}
+		else
+		{
+			skipped = true;
+			tid = &scandesc->xs_heaptid;
+		}
+	}
+
+	readDirection = skipped ? indexonlyscan->indexorderdir : direction;
+
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (skipped || (tid = index_getnext_tid(scandesc, readDirection)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		skipped = false;
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -248,6 +285,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 							  ItemPointerGetBlockNumber(tid),
 							  estate->es_snapshot);
 
+		node->ioss_FirstTupleEmitted = true;
+
 		return slot;
 	}
 
@@ -502,6 +541,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+	indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->ioss_FirstTupleEmitted = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 90b5da51c9..af8678ce51 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -523,6 +523,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
 	COPY_NODE_FIELD(indexorderby);
 	COPY_NODE_FIELD(indextlist);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 2b0236937a..473d522db3 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -584,6 +584,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
 	WRITE_NODE_FIELD(indexorderby);
 	WRITE_NODE_FIELD(indextlist);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 3f68f7c18d..169fb408d3 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1888,6 +1888,7 @@ _readIndexOnlyScan(void)
 	READ_NODE_FIELD(indexorderby);
 	READ_NODE_FIELD(indextlist);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8dc7dd4ca2..efb2954338 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -133,6 +133,7 @@ int			max_parallel_workers_per_gather = 2;
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
+bool		enable_indexskipscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 0ef70ad7f1..00526f3476 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -784,6 +784,16 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexPath  *ipath = (IndexPath *) lfirst(lc);
 
+		/*
+		 * To prevent unique paths from index skip scans being potentially used
+		 * when not needed scan keep them in a separate pathlist.
+		*/
+		if (ipath->indexskipprefix != 0)
+		{
+			add_unique_path(rel, (Path *) ipath);
+			continue;
+		}
+
 		if (index->amhasgettuple)
 			add_path(rel, (Path *) ipath);
 
@@ -866,12 +876,15 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	double		loop_count;
 	List	   *orderbyclauses;
 	List	   *orderbyclausecols;
-	List	   *index_pathkeys;
+	List	   *index_pathkeys = NIL;
 	List	   *useful_pathkeys;
+	List	   *index_pathkeys_pos = NIL;
 	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
+	bool		not_empty_qual = false;
+	bool		can_skip;
 	int			indexcol;
 
 	/*
@@ -989,7 +1002,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	if (index_is_ordered && pathkeys_possibly_useful)
 	{
 		index_pathkeys = build_index_pathkeys(root, index,
-											  ForwardScanDirection);
+											  ForwardScanDirection,
+											  &index_pathkeys_pos);
 		useful_pathkeys = truncate_useless_pathkeys(root, rel,
 													index_pathkeys);
 		orderbyclauses = NIL;
@@ -1021,6 +1035,72 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	index_only_scan = (scantype != ST_BITMAPSCAN &&
 					   check_index_only(rel, index));
 
+	/* Check if an index skip scan is possible. */
+	can_skip = enable_indexskipscan & index->amcanskip & index_only_scan;
+
+	if (can_skip)
+	{
+		/*
+		 * Skip scan is not supported when there are qual conditions, which are
+		 * not covered by index. The reason for that is that those conditions
+		 * are evaluated later, already after skipping was applied.
+		 *
+		 * TODO: This implementation is too restrictive, and doesn't allow e.g.
+		 * index expressions. For that we need to examine index_clauses too.
+		 */
+		if (root->parse->jointree != NULL)
+		{
+			ListCell *lc;
+
+			foreach(lc, (List *) root->parse->jointree->quals)
+			{
+				Node *expr, *qual = (Node *) lfirst(lc);
+				OpExpr *expr_op;
+				Var *var;
+				bool found = false;
+
+				if (!is_opclause(qual))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				expr = get_leftop(qual);
+				expr_op = (OpExpr *) qual;
+
+				if (!IsA(expr, Var))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				var = (Var *) expr;
+
+				/*
+				 * Check if the qual operator is indexable by any columns of
+				 * the index, test collation and opfamily.
+				 */
+				for (int i = 0; i < index->ncolumns; i++)
+				{
+					if (index->indexkeys[i] == var->varattno &&
+						IndexCollMatchesExprColl(index->indexcollations[i],
+												 expr_op->inputcollid) &&
+						op_in_opfamily(expr_op->opno, index->opfamily[i]))
+					{
+						found = true;
+						break;
+					}
+				}
+
+				if (!found)
+				{
+					not_empty_qual = true;
+					break;
+				}
+			}
+		}
+	}
+
 	/*
 	 * 4. Generate an indexscan path if there are relevant restriction clauses
 	 * in the current clauses, OR the index ordering is potentially useful for
@@ -1044,6 +1124,33 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  false);
 		result = lappend(result, ipath);
 
+		/* Consider index skip scan as well */
+		if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
+		{
+			int numusefulkeys = list_length(useful_pathkeys);
+			int numsortkeys = list_length(root->query_pathkeys);
+
+			if (numusefulkeys == numsortkeys)
+			{
+				int prefix;
+				if (list_length(root->distinct_pathkeys) > 0)
+					prefix = find_index_prefix_for_pathkey(index_pathkeys,
+														   index_pathkeys_pos,
+														   llast_node(PathKey,
+														   root->distinct_pathkeys));
+				else
+					/*
+					 * All distinct keys are constant and optimized away.
+					 * Skipping with 1 is sufficient.
+					 */
+					prefix = 1;
+
+				result = lappend(result,
+								 create_skipscan_unique_path(root, index,
+															 (Path *) ipath, prefix));
+			}
+		}
+
 		/*
 		 * If appropriate, consider parallel index scan.  We don't allow
 		 * parallel index scan for bitmap index scans.
@@ -1082,7 +1189,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	if (index_is_ordered && pathkeys_possibly_useful)
 	{
 		index_pathkeys = build_index_pathkeys(root, index,
-											  BackwardScanDirection);
+											  BackwardScanDirection,
+											  &index_pathkeys_pos);
 		useful_pathkeys = truncate_useless_pathkeys(root, rel,
 													index_pathkeys);
 		if (useful_pathkeys != NIL)
@@ -1099,6 +1207,32 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  false);
 			result = lappend(result, ipath);
 
+			/* Consider index skip scan as well */
+			if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
+			{
+				int numusefulkeys = list_length(useful_pathkeys);
+				int numsortkeys = list_length(root->query_pathkeys);
+
+				if (numusefulkeys == numsortkeys)
+				{
+					int prefix;
+					if (list_length(root->distinct_pathkeys) > 0)
+						prefix = find_index_prefix_for_pathkey(index_pathkeys,
+															   index_pathkeys_pos,
+															   llast_node(PathKey,
+															   root->distinct_pathkeys));
+					else
+						/* all are distinct keys are constant and optimized away.
+						 * skipping with 1 is sufficient as all are constant anyway
+						 */
+						prefix = 1;
+
+					result = lappend(result,
+									 create_skipscan_unique_path(root, index,
+																 (Path *) ipath, prefix));
+				}
+			}
+
 			/* If appropriate, consider parallel index scan */
 			if (index->amcanparallel &&
 				rel->consider_parallel && outer_relids == NULL &&
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index e2be1fbf90..cfdff4eee9 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -523,6 +523,47 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  *		NEW PATHKEY FORMATION
  ****************************************************************************/
 
+/*
+ * Find the prefix size for a specific path key in an index. For example, an
+ * index with (a,b,c) finding path key b will return prefix 2. Optionally
+ * pathkeys_positions can be provided, to specify at which position in the
+ * original pathkey list this particular key could be found (this is helpful
+ * when we deal with redundant pathkeys).
+ *
+ * Returns 0 when not found.
+ */
+int
+find_index_prefix_for_pathkey(List *index_pathkeys,
+							  List *pathkeys_positions,
+							  PathKey *target_pathkey)
+{
+	ListCell   *lc;
+	int			i;
+
+	i = 0;
+	foreach(lc, index_pathkeys)
+	{
+		PathKey    *cpathkey = (PathKey *) lfirst(lc);
+
+		if (cpathkey == target_pathkey)
+		{
+			/*
+			 * Prefix expected to start from 1, increment positions since
+			 * they're 0 based.
+			 */
+			if (pathkeys_positions != NIL &&
+				pathkeys_positions->length > i)
+				return list_nth_int(pathkeys_positions, i) + 1;
+			else
+				return i + 1;
+		}
+
+		i++;
+	}
+
+	return 0;
+}
+
 /*
  * build_index_pathkeys
  *	  Build a pathkeys list that describes the ordering induced by an index
@@ -535,7 +576,9 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  * We iterate only key columns of covering indexes, since non-key columns
  * don't influence index ordering.  The result is canonical, meaning that
  * redundant pathkeys are removed; it may therefore have fewer entries than
- * there are key columns in the index.
+ * there are key columns in the index. Since by removing redundant pathkeys the
+ * information about original position is lost, return it via positions
+ * argument.
  *
  * Another reason for stopping early is that we may be able to tell that
  * an index column's sort order is uninteresting for this query.  However,
@@ -546,7 +589,8 @@ get_cheapest_parallel_safe_total_inner(List *paths)
 List *
 build_index_pathkeys(PlannerInfo *root,
 					 IndexOptInfo *index,
-					 ScanDirection scandir)
+					 ScanDirection scandir,
+					 List **positions)
 {
 	List	   *retval = NIL;
 	ListCell   *lc;
@@ -555,6 +599,8 @@ build_index_pathkeys(PlannerInfo *root,
 	if (index->sortopfamily == NULL)
 		return NIL;				/* non-orderable index */
 
+	*positions = NIL;
+
 	i = 0;
 	foreach(lc, index->indextlist)
 	{
@@ -608,7 +654,11 @@ build_index_pathkeys(PlannerInfo *root,
 			 * for this query.  Add it to list, unless it's redundant.
 			 */
 			if (!pathkey_is_redundant(cpathkey, retval))
+			{
 				retval = lappend(retval, cpathkey);
+				*positions = lappend_int(*positions,
+										 foreach_current_index(lc));
+			}
 		}
 		else
 		{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index cd6d72c763..98738c9d9c 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -191,7 +191,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 List *indexqual, List *recheckqual,
 										 List *indexorderby,
 										 List *indextlist,
-										 ScanDirection indexscandir);
+										 ScanDirection indexscandir,
+										 int skipprefix);
 static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
 											  List *indexqual,
 											  List *indexqualorig);
@@ -3092,7 +3093,8 @@ create_indexscan_plan(PlannerInfo *root,
 												stripped_indexquals,
 												fixed_indexorderbys,
 												indexinfo->indextlist,
-												best_path->indexscandir);
+												best_path->indexscandir,
+												best_path->indexskipprefix);
 	else
 		scan_plan = (Scan *) make_indexscan(tlist,
 											qpqual,
@@ -5466,7 +5468,8 @@ make_indexonlyscan(List *qptlist,
 				   List *recheckqual,
 				   List *indexorderby,
 				   List *indextlist,
-				   ScanDirection indexscandir)
+				   ScanDirection indexscandir,
+				   int skipPrefixSize)
 {
 	IndexOnlyScan *node = makeNode(IndexOnlyScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5482,6 +5485,7 @@ make_indexonlyscan(List *qptlist,
 	node->indexorderby = indexorderby;
 	node->indextlist = indextlist;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index abb77d867e..0c61795389 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3095,6 +3095,43 @@ create_upper_unique_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_skipscan_unique_path
+ *	  Creates a pathnode the same as an existing IndexPath except based on
+ *	  skipping duplicate values.  This may or may not be cheaper than using
+ *	  create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root, IndexOptInfo *index,
+							Path *basepath, int prefix)
+{
+	IndexPath 	*pathnode = makeNode(IndexPath);
+	int 		numDistinctRows;
+	List 		*uniqExprs;
+
+	Assert(IsA(basepath, IndexPath));
+
+	/* We don't want to modify basepath, so make a copy. */
+	memcpy(pathnode, basepath, sizeof(IndexPath));
+
+	uniqExprs = linitial_node(List, root->query_uniquekeys);
+
+	Assert(prefix > 0);
+	pathnode->indexskipprefix = prefix;
+	pathnode->path.uniquekeys = root->query_uniquekeys;
+
+	numDistinctRows = estimate_num_groups(root, uniqExprs,
+										  pathnode->path.rows,
+										  NULL, NULL);
+
+	pathnode->path.total_cost = pathnode->path.startup_cost * numDistinctRows;
+	pathnode->path.rows = numDistinctRows;
+
+	return pathnode;
+}
+
 /*
  * create_agg_path
  *	  Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index a5002ad895..1e6fb0c543 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -272,6 +272,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanskip = (amroutine->amskip != NULL);
 			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4c94f09c64..6a1a2e42a8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1002,6 +1002,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-skip-scan plans."),
+			NULL
+		},
+		&enable_indexskipscan,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 817d5f5324..ac9abfb552 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -367,6 +367,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexskipscan = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index a382551a98..cb2f48a1bc 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -173,6 +173,11 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+								 ScanDirection dir,
+								 int prefix);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -277,6 +282,7 @@ typedef struct IndexAmRoutine
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
 	amrestrpos_function amrestrpos; /* can be NULL */
+	amskip_function amskip;				/* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 134b20f1e6..d13d95c458 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -183,6 +183,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
 												   IndexBulkDeleteResult *istat);
 extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/sdir.h b/src/include/access/sdir.h
index 1ab4d5e19a..fd71629da4 100644
--- a/src/include/access/sdir.h
+++ b/src/include/access/sdir.h
@@ -55,4 +55,11 @@ typedef enum ScanDirection
 #define ScanDirectionIsForward(direction) \
 	((bool) ((direction) == ForwardScanDirection))
 
+/*
+ * ScanDirectionsAreOpposite
+ *		True iff scan directions are backward/forward or forward/backward.
+ */
+#define ScanDirectionsAreOpposite(dirA, dirB) \
+	((bool) (dirA != NoMovementScanDirection && dirA == -dirB))
+
 #endif							/* SDIR_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4ea8735dd8..f28ec2f830 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1526,6 +1526,8 @@ typedef struct IndexOnlyScanState
 	struct IndexScanDescData *ioss_ScanDesc;
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
+	int         ioss_SkipPrefixSize;
+	bool		ioss_FirstTupleEmitted;
 	Size		ioss_PscanLen;
 } IndexOnlyScanState;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 056b13826a..40997ee759 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1243,6 +1243,9 @@ typedef struct Path
  * we need not recompute them when considering using the same index in a
  * bitmap index/heap scan (see BitmapHeapPath).  The costs of the IndexPath
  * itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
  *----------
  */
 typedef struct IndexPath
@@ -1255,6 +1258,7 @@ typedef struct IndexPath
 	ScanDirection indexscandir;
 	Cost		indextotalcost;
 	Selectivity indexselectivity;
+	int			indexskipprefix;
 } IndexPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 0b518ce6b2..6b3eefebc6 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -453,6 +453,8 @@ typedef struct IndexOnlyScan
 	List	   *indexorderby;	/* list of index ORDER BY exprs */
 	List	   *indextlist;		/* TargetEntry list describing index's cols */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for distinct
+										 * scans */
 } IndexOnlyScan;
 
 /* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 356a51f370..03d5816c82 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
 extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index bb6d730e93..227cda4bd7 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -218,6 +218,10 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
 												 Path *subpath,
 												 int numCols,
 												 double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+											  IndexOptInfo *index,
+											  Path *subpath,
+											  int prefix);
 extern AggPath *create_agg_path(PlannerInfo *root,
 								RelOptInfo *rel,
 								Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 3dfa21adad..72b3bd9059 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -212,8 +212,11 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   Relids required_outer,
 													   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
+extern int find_index_prefix_for_pathkey(List *index_pathkeys,
+										 List *pathkey_positions,
+										 PathKey *target_pathkey);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
-								  ScanDirection scandir);
+								  ScanDirection scandir, List **positions);
 extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
 									  ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 2088857615..cd564b0316 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -110,6 +110,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexskipscan           | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -122,7 +123,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(20 rows)
+(21 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
-- 
2.32.0

v40-0003-amskip-implementation-for-Btree.patchtext/x-diff; charset=us-asciiDownload

From 74cb20467e3d025a80cb2e5294bd1bba69303d0b Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Thu, 20 May 2021 21:13:38 +0200
Subject: [PATCH v40 3/6] amskip implementation for Btree

Btree implementation of index am method amskip for Index Skip Scan. To
make it more robust and suitable for both situations:

* small number of distinct values (e.g. because of the planner
underestimation)

* significant amounts of distinct values

a mixed approach is implemented. Instead of restarting the search for
every value, first we check if there is a next distinct value on the
current page. Only then if no such value was found restart and search
from the tree root.

No support for backward scan is implemented in case of a scroll cursor,
instead a Material node will be put on top.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 src/backend/access/nbtree/nbtree.c            |  12 +
 src/backend/access/nbtree/nbtsearch.c         | 215 +++++-
 src/include/access/nbtree.h                   |   5 +
 src/test/regress/expected/join.out            |   3 +
 src/test/regress/expected/select_distinct.out | 642 ++++++++++++++++++
 src/test/regress/sql/join.sql                 |   5 +
 src/test/regress/sql/select_distinct.sql      | 282 ++++++++
 7 files changed, 1163 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 13024af2fa..393ceb3054 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -124,6 +124,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
+	amroutine->amskip = btskip;
 	amroutine->amcostestimate = btcostestimate;
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
@@ -375,6 +376,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	 */
 	so->currTuples = so->markTuples = NULL;
 
+	so->skipScanKey = NULL;
+
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
@@ -441,6 +444,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	_bt_preprocess_array_keys(scan);
 }
 
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+	return _bt_skip(scan, direction, prefix);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 9d82d4904d..0ad761916a 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -45,7 +45,11 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+											Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan,
+										   BTScanInsert key,
+										   Buffer buf);
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -1498,6 +1502,161 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 	return true;
 }
 
+/*
+ *  _bt_skip() -- Skip items that have the same prefix as the most recently
+ * 				  fetched index tuple.
+ *
+ * 		The current position is set so that a subsequent call to _bt_next will
+ * 		fetch the first tuple that differs in the leading 'prefix' keys.
+ *
+ * 		The current page is searched for the next unique value. If none is found
+ * 		we will do a scan from the root in order to find the next page with
+ * 		a unique value.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTStack stack;
+	Buffer buf;
+	OffsetNumber offnum;
+	BTScanPosItem *currItem;
+	Relation 	 indexRel = scan->indexRelation;
+	bool scanstart = !BTScanPosIsValid(so->currPos);
+
+	/* We want to return tuples, and we need a starting point */
+	Assert(scan->xs_want_itup);
+	Assert(scan->xs_itup);
+
+	if (so->numKilled > 0)
+		_bt_killitems(scan);
+
+	/* If skipScanKey is NULL then we initialize it with _bt_mkscankey */
+	if (so->skipScanKey == NULL)
+	{
+		so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+		so->skipScanKey->keysz = prefix;
+		so->skipScanKey->scantid = NULL;
+	}
+	so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+	_bt_update_skip_scankeys(scan, indexRel);
+
+	/* Check if the next unique key can be found within the current page.
+	 * Since we do not lock the current page between jumps, it's possible
+	 * that it was splitted since the last time we saw it. This is fine in
+	 * case of scanning forward, since page split to the right and we are
+	 * still on the left most page. In case of scanning backwards it's
+	 * possible to loose some pages and we need to remember the previous
+	 * page, and then follow the right link from the current page until we
+	 * find the original one.
+	 *
+	 * Since the whole idea of checking the current page is to protect
+	 * ourselves and make more performant statistic mismatch case when
+	 * there are too many distinct values for jumping, it's not clear if
+	 * the complexity of this solution in case of backward scan is
+	 * justified, so for now just avoid it.
+	 */
+	if (BufferIsValid(so->currPos.buf) && ScanDirectionIsForward(dir))
+	{
+		_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+
+		if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf))
+		{
+			bool keyFound = false;
+
+			offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+			/* Lock the page for SERIALIZABLE transactions */
+			PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+							  scan->xs_snapshot);
+
+			/* We know in which direction to look */
+			_bt_initialize_more_data(so, dir);
+
+			/* Now read the data */
+			keyFound = _bt_readpage(scan, dir, offnum);
+
+			_bt_relbuf(indexRel, so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+
+			if (keyFound)
+			{
+				/* set IndexTuple */
+				currItem = &so->currPos.items[so->currPos.itemIndex];
+				scan->xs_heaptid = currItem->heapTid;
+				scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+				return true;
+			}
+		}
+		else
+			_bt_unlockbuf(indexRel, so->currPos.buf);
+	}
+
+	if (BufferIsValid(so->currPos.buf))
+	{
+		ReleaseBuffer(so->currPos.buf);
+		so->currPos.buf = InvalidBuffer;
+	}
+
+	/*
+	 * We haven't found scan key within the current page, so let's scan from
+	 * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+	 * number
+	 */
+	stack = _bt_search(scan->indexRelation, so->skipScanKey,
+					   &buf, BT_READ, scan->xs_snapshot);
+	_bt_freestack(stack);
+	so->currPos.buf = buf;
+	offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+	/* Lock the page for SERIALIZABLE transactions */
+	PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+					  scan->xs_snapshot);
+
+	/* We know in which direction to look */
+	_bt_initialize_more_data(so, dir);
+
+	/*
+	 * Simplest case is when both directions are forward, when we are already
+	 * at the next distinct key at the beginning of the series (so everything
+	 * else would be done in _bt_readpage)
+	 *
+	 * The case when both directions are backwards is also simple, but we need
+	 * to go one step back, since we need a last element from the previous
+	 * series.
+	 */
+	if (ScanDirectionIsBackward(dir) || (ScanDirectionIsForward(dir) & scanstart))
+		 offnum = OffsetNumberPrev(offnum);
+
+	/* Now read the data */
+	if (!_bt_readpage(scan, dir, offnum))
+	{
+		/*
+		 * There's no actually-matching data on this page.  Try to advance to
+		 * the next page.  Return false if there's no matching data at all.
+		 */
+		_bt_unlockbuf(indexRel, so->currPos.buf);
+		if (!_bt_steppage(scan, dir))
+		{
+			pfree(so->skipScanKey);
+			so->skipScanKey = NULL;
+			return false;
+		}
+	}
+	else
+		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+	/* And set IndexTuple */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_heaptid = currItem->heapTid;
+	scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+	so->currPos.moreLeft = true;
+	so->currPos.moreRight = true;
+
+	return true;
+}
+
 /*
  *	_bt_readpage() -- Load data from current index page into so->currPos
  *
@@ -2494,3 +2653,57 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 	so->numKilled = 0;			/* just paranoia */
 	so->markItemIndex = -1;		/* ditto */
 }
+
+/*
+ * _bt_update_skip_scankeys() -- set up new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+	TupleDesc		itupdesc;
+	int			indnkeyatts,
+				i;
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	ScanKey			scankeys = so->skipScanKey->scankeys;
+
+	itupdesc = RelationGetDescr(indexRel);
+	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+	for (i = 0; i < indnkeyatts; i++)
+	{
+		Datum datum;
+		bool null;
+		int flags;
+
+		datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+		flags = (null ? SK_ISNULL : 0) |
+				(indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+		scankeys[i].sk_flags = flags;
+		scankeys[i].sk_argument = datum;
+	}
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ *
+ * Scankey nextkey will tell us if we need to find a current key or the next
+ * one, which affects whether or not it's ok to be equal to the page highkey.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, Buffer buf)
+{
+	OffsetNumber low, high;
+	Page page = BufferGetPage(buf);
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	int high_compare = key->nextkey ? 0 : 1;
+
+	low = P_FIRSTDATAKEY(opaque);
+	high = PageGetMaxOffsetNumber(page);
+
+	if (unlikely(high < low))
+		return false;
+
+	return (_bt_compare(scan->indexRelation, key, page, low) > 0 &&
+			_bt_compare(scan->indexRelation, key, page, high) < high_compare);
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9fec6fb1a8..2c516654c2 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1064,6 +1064,9 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* Work space for _bt_skip */
+	BTScanInsert	skipScanKey;	/* used to control skipping */
+
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -1229,6 +1232,7 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 							   Snapshot snapshot);
 
@@ -1253,6 +1257,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
 extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
 extern bool btproperty(Oid index_oid, int attno,
 					   IndexAMProperty prop, const char *propname,
 					   bool *res, bool *isnull);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index d5b5b775fd..5d3fcc0d21 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4614,6 +4614,8 @@ select d.* from d left join (select * from b group by b.id, b.c_id) s
          ->  Seq Scan on d
 (8 rows)
 
+-- disable index skip scan to prevent it interfering with the plan
+set enable_indexskipscan to off;
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
@@ -4631,6 +4633,7 @@ select d.* from d left join (select distinct * from b) s
          ->  Seq Scan on d
 (9 rows)
 
+set enable_indexskipscan to on;
 -- check join removal works when uniqueness of the join condition is enforced
 -- by a UNION
 explain (costs off)
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index 58122c6f88..dfc90c797a 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -375,3 +375,645 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
  t
 (1 row)
 
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+CREATE INDEX ON distinct_a ((a + 1));
+ANALYZE distinct_a;
+SELECT DISTINCT a FROM distinct_a;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+ a 
+---
+ 1
+(1 row)
+
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Index Only Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+   Distinct Prefix: 1
+(3 rows)
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- test index skip scan for expressions
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT (a + 1) FROM distinct_a ORDER BY (a + 1);
+             QUERY PLAN             
+------------------------------------
+ Sort
+   Sort Key: ((a + 1))
+   ->  HashAggregate
+         Group Key: (a + 1)
+         ->  Seq Scan on distinct_a
+(5 rows)
+
+SELECT DISTINCT (a + 1) FROM distinct_a ORDER BY (a + 1);
+ ?column? 
+----------
+        2
+        3
+        4
+        5
+        6
+(5 rows)
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: true
+   Distinct Prefix: 2
+   Index Cond: (b = 2)
+(4 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: true
+   Distinct Prefix: 2
+   Index Cond: (b = 2)
+(4 rows)
+
+DROP INDEX distinct_a_b_a;
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b 
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+                          QUERY PLAN                          
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: true
+   Distinct Prefix: 1
+   Index Cond: (c = 2)
+(4 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+                              QUERY PLAN                               
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: true
+   Distinct Prefix: 1
+   Index Cond: (c = 2)
+(4 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Unique
+   ->  Index Scan using distinct_a_a_b_idx on distinct_a
+         Index Cond: (b = 2)
+         Filter: (c = 10)
+(4 rows)
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+ a | a 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 3
+ 4 | 4
+ 5 | 5
+(5 rows)
+
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+ a | ?column? 
+---+----------
+ 1 |        1
+ 2 |        1
+ 3 |        1
+ 4 |        1
+ 5 |        1
+(5 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+FETCH FROM c;
+ a 
+---
+ 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a 
+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 |  9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+                                 QUERY PLAN                                 
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+   Skip scan: true
+   Distinct Prefix: 1
+   Index Cond: ((b >= 1) AND (c = 0))
+(4 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c 
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
+-- test tuple killing
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 1 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 5 | 1 | 1 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 5 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 1 | 1 | 1 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(5 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(5 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- test posting lists
+CREATE TABLE distinct_posting (a int, b int, c int);
+CREATE INDEX ON distinct_posting (a, b, c);
+INSERT INTO distinct_posting
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+INSERT INTO distinct_posting (
+    SELECT 1 as a, 1 as b, 1 AS c
+        FROM generate_series(1,1000) i
+);
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c FROM distinct_posting WHERE c = 2
+    ORDER BY a DESC, b DESC;
+    FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+COMMIT;
+-- test that quals are check for indexability before applied
+CREATE TABLE Indexable_quals (a text, b text, c text);
+CREATE INDEX ON indexable_quals (a, b, c);
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'yyy');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'yyy');
+SELECT DISTINCT ON (a, b)  a, b
+FROM indexable_quals WHERE c LIKE '%y%' AND a LIKE 'a%' AND b = 'b';
+ a  | b 
+----+---
+ a1 | b
+ a2 | b
+(2 rows)
+
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 52240fea7e..de85bcc308 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1585,11 +1585,16 @@ explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
   on d.a = s.id;
 
+-- disable index skip scan to prevent it interfering with the plan
+set enable_indexskipscan to off;
+
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
   on d.a = s.id;
 
+set enable_indexskipscan to on;
+
 -- check join removal works when uniqueness of the join condition is enforced
 -- by a UNION
 explain (costs off)
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index 1bfe59c26f..3d1ebf515c 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -174,3 +174,285 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
 SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
 SELECT 2 IS NOT DISTINCT FROM null as "no";
 SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+CREATE INDEX ON distinct_a ((a + 1));
+ANALYZE distinct_a;
+
+SELECT DISTINCT a FROM distinct_a;
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- test index skip scan for expressions
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT (a + 1) FROM distinct_a ORDER BY (a + 1);
+SELECT DISTINCT (a + 1) FROM distinct_a ORDER BY (a + 1);
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+DROP INDEX distinct_a_b_a;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
+
+-- test tuple killing
+
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- test posting lists
+CREATE TABLE distinct_posting (a int, b int, c int);
+CREATE INDEX ON distinct_posting (a, b, c);
+INSERT INTO distinct_posting
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+INSERT INTO distinct_posting (
+    SELECT 1 as a, 1 as b, 1 AS c
+        FROM generate_series(1,1000) i
+);
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c FROM distinct_posting WHERE c = 2
+    ORDER BY a DESC, b DESC;
+    FETCH ALL FROM c;
+
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+-- test that quals are check for indexability before applied
+CREATE TABLE Indexable_quals (a text, b text, c text);
+CREATE INDEX ON indexable_quals (a, b, c);
+
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'yyy');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'yyy');
+
+SELECT DISTINCT ON (a, b)  a, b
+FROM indexable_quals WHERE c LIKE '%y%' AND a LIKE 'a%' AND b = 'b';
-- 
2.32.0

v40-0004-Extend-amskip-implementation-for-Btree.patchtext/x-diff; charset=us-asciiDownload

From 2bf504b165010a3728a0b9d9273ef8e44ac1e09c Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Thu, 20 May 2021 21:17:51 +0200
Subject: [PATCH v40 4/6] Extend amskip implementation for Btree

Add support for backward scan to Btree amskip implementation. This will
make index skip scan work without Material node in case of scrolling cursor.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 src/backend/access/index/indexam.c       |   6 +-
 src/backend/access/nbtree/nbtree.c       |   5 +-
 src/backend/access/nbtree/nbtsearch.c    | 302 ++++++++++++++++++++++-
 src/backend/executor/execAmi.c           |  32 +--
 src/backend/executor/nodeIndexonlyscan.c |  57 ++++-
 src/include/access/amapi.h               |   1 +
 src/include/access/genam.h               |   3 +-
 src/include/access/nbtree.h              |   6 +-
 8 files changed, 371 insertions(+), 41 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index bcf7c73467..9fa5db27ea 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -748,11 +748,13 @@ index_can_return(Relation indexRelation, int attno)
  * ----------------
  */
 bool
-index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+index_skip(IndexScanDesc scan, ScanDirection direction,
+		   ScanDirection indexdir, bool scanstart, int prefix)
 {
 	SCAN_CHECKS;
 
-	return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+	return scan->indexRelation->rd_indam->amskip(scan, direction,
+												 indexdir, prefix);
 }
 
 /* ----------------
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 393ceb3054..723e926059 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -448,9 +448,10 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
  * btskip() -- skip to the beginning of the next key prefix
  */
 bool
-btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+btskip(IndexScanDesc scan, ScanDirection direction,
+	   ScanDirection indexdir, int prefix)
 {
-	return _bt_skip(scan, direction, prefix);
+	return _bt_skip(scan, direction, indexdir, prefix);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 0ad761916a..00742c7e21 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1509,12 +1509,31 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * 		The current position is set so that a subsequent call to _bt_next will
  * 		fetch the first tuple that differs in the leading 'prefix' keys.
  *
- * 		The current page is searched for the next unique value. If none is found
- * 		we will do a scan from the root in order to find the next page with
- * 		a unique value.
+ * 		There are four different kinds of skipping (depending on dir and
+ * 		indexdir, that are important to distinguish, especially in the presense
+ * 		of an index condition:
+ *
+ * 		* Advancing forward and reading forward
+ * 			simple scan
+ *
+ * 		* Advancing forward and reading backward
+ * 			scan inside a cursor fetching backward, when skipping is necessary
+ * 			right from the start
+ *
+ * 		* Advancing backward and reading forward
+ * 			scan with order by desc inside a cursor fetching forward, when
+ * 			skipping is necessary right from the start
+ *
+ * 		* Advancing backward and reading backward
+ * 			simple scan with order by desc
+ *
+ *      The current page is searched for the next unique value. If none is found
+ *      we will do a scan from the root in order to find the next page with
+ *      a unique value.
  */
 bool
-_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+_bt_skip(IndexScanDesc scan, ScanDirection dir,
+		 ScanDirection indexdir, int prefix)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTStack stack;
@@ -1625,11 +1644,282 @@ _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
 	 * to go one step back, since we need a last element from the previous
 	 * series.
 	 */
-	if (ScanDirectionIsBackward(dir) || (ScanDirectionIsForward(dir) & scanstart))
+	if ((ScanDirectionIsBackward(dir) && ScanDirectionIsBackward(indexdir)) ||
+		(ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir) & scanstart))
 		 offnum = OffsetNumberPrev(offnum);
 
+	/*
+	 * Andvance backward but read forward. At this moment we are at the next
+	 * distinct key at the beginning of the series. In case if scan just
+	 * started, we can read forward without doing anything else. Otherwise
+	 * find previous distinct key and the beginning of it's series and read
+	 * forward from there. To do so, go back one step, perform binary search
+	 * to find the first item in the series and let _bt_readpage do everything
+	 * else.
+	 */
+	else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir) && !scanstart)
+	{
+		/* Reading forward means we expect to see more data on the right */
+		so->currPos.moreRight = true;
+
+		/* One step back to find a previous value */
+		if (!_bt_readpage(scan, dir, offnum) ||
+		 	--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			_bt_unlockbuf(indexRel, so->currPos.buf);
+			if (!_bt_steppage(scan, dir))
+			{
+				pfree(so->skipScanKey);
+				so->skipScanKey = NULL;
+				return false;
+			}
+		}
+		else
+			_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+		currItem = &so->currPos.items[so->currPos.itemIndex];
+		scan->xs_heaptid = currItem->heapTid;
+		if (scan->xs_want_itup)
+			scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+		_bt_update_skip_scankeys(scan, indexRel);
+
+		/*
+		 * And now find the last item from the sequence for the
+		 * current, value with the intention do OffsetNumberNext. As a
+		 * result we end up on a first element from the sequence.
+		 */
+		if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf))
+			offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+		else
+		{
+			/* Before leaving current page, deal with any killed items */
+			if (so->numKilled > 0)
+				_bt_killitems(scan);
+
+			ReleaseBuffer(so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+
+			stack = _bt_search(scan->indexRelation, so->skipScanKey,
+							   &buf, BT_READ, scan->xs_snapshot);
+			_bt_freestack(stack);
+			so->currPos.buf = buf;
+			offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+		}
+	}
+
+	/*
+	 * Advance forward but read backward. At this moment we are at the next
+	 * distinct key at the beginning of the series. In case if scan just
+	 * started, we can go one step back and read forward without doing
+	 * anything else. Otherwise find the next distinct key and the beginning
+	 * of it's series, go one step back and read backward from there.
+	 *
+	 * An interesting situation can happen if one of distinct keys do not pass
+	 * a corresponding index condition at all. In this case reading backward
+	 * can lead to a previous distinct key being found, creating a loop. To
+	 * avoid that check the value to be returned, and jump one more time if
+	 * it's the same as at the beginning. Note that we do not check visibility
+	 * here, and dead tuples could also lead to the same situation. This has to
+	 * be checked on the caller side.
+	 */
+	else if (ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir) && !scanstart)
+	{
+		IndexTuple 	startItup = CopyIndexTuple(scan->xs_itup);
+		bool 		nextFound = false;
+
+		/* Reading backwards means we expect to see more data on the left */
+		so->currPos.moreLeft = true;
+
+		for (;;)
+		{
+			IndexTuple itup;
+			OffsetNumber jumpOffset;
+
+			if (nextFound)
+				break;
+
+			/*
+			 * Find a next index tuple to update scan key. It could be at
+			 * the end, so check for max offset
+			 */
+			if (!_bt_readpage(scan, ForwardScanDirection, offnum))
+			{
+				/*
+				 * There's no actually-matching data on this page.  Try to
+				 * advance to the next page. Return false if there's no
+				 * matching data at all.
+				 */
+				_bt_unlockbuf(indexRel, so->currPos.buf);
+				if (!_bt_steppage(scan, dir))
+				{
+					pfree(so->skipScanKey);
+					so->skipScanKey = NULL;
+					return false;
+				}
+			}
+			else
+				_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+			/* check for interrupts while we're not holding any buffer lock */
+			CHECK_FOR_INTERRUPTS();
+
+			currItem = &so->currPos.items[so->currPos.firstItem];
+			itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+			scan->xs_itup = itup;
+
+			_bt_update_skip_scankeys(scan, indexRel);
+
+			/* Before leaving current page, deal with any killed items */
+			if (so->numKilled > 0)
+				_bt_killitems(scan);
+
+			ReleaseBuffer(so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+
+			stack = _bt_search(scan->indexRelation, so->skipScanKey,
+							   &buf, BT_READ, scan->xs_snapshot);
+			_bt_freestack(stack);
+			so->currPos.buf = buf;
+
+			/*
+			 * We need to remember the original offset after the jump,
+			 * since in case of looping this would be the next starting
+			 * point
+			 */
+			jumpOffset = offnum = _bt_binsrch(scan->indexRelation,
+											  so->skipScanKey, buf);
+			offnum = OffsetNumberPrev(offnum);
+
+			if (!_bt_readpage(scan, indexdir, offnum))
+			{
+				/*
+				 * There's no actually-matching data on this page.  Try to
+				 * advance to the next page. Return false if there's no
+				 * matching data at all.
+				 */
+				_bt_unlockbuf(indexRel, so->currPos.buf);
+				if (!_bt_steppage(scan, indexdir))
+				{
+					pfree(so->skipScanKey);
+					so->skipScanKey = NULL;
+					return false;
+				}
+				_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+			}
+
+			currItem = &so->currPos.items[so->currPos.lastItem];
+			itup = CopyIndexTuple((IndexTuple)
+					(so->currTuples + currItem->tupleOffset));
+
+			/*
+			 * To check if we returned the same tuple, try to find a
+			 * startItup on the current page. For that we need to update
+			 * scankey to match the whole tuple and set nextkey to return
+			 * an exact tuple, not the next one. If the tuple we found in
+			 * this way is equal to what we wanted to return, it means we
+			 * are in the loop, return offnum to the original position and
+			 * jump further
+			 *
+			 * Note that to compare tids we need to keep the leaf pinned,
+			 * otherwise there is a danger of vacuum cleaning up relevant
+			 * tuples.
+			 */
+			scan->xs_itup = startItup;
+			_bt_update_skip_scankeys(scan, indexRel);
+
+			so->skipScanKey->keysz = IndexRelationGetNumberOfKeyAttributes(indexRel);
+			so->skipScanKey->nextkey = false;
+
+			if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf))
+			{
+				OffsetNumber maxoff, startOffset;
+				IndexTuple verifiedItup;
+				Page page = BufferGetPage(so->currPos.buf);
+				startOffset = _bt_binsrch(scan->indexRelation,
+										  so->skipScanKey,
+										  so->currPos.buf);
+
+				maxoff = PageGetMaxOffsetNumber(page);
+
+				/* Now read the data */
+				if (_bt_readpage(scan, ForwardScanDirection, startOffset))
+				{
+					ItemPointer resultTids, verifyTids;
+					int nresult = 1,
+						nverify = 1;
+
+					currItem = &so->currPos.items[so->currPos.itemIndex];
+					verifiedItup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+					/*
+					 * We need to keep in mind that tuples we deal with
+					 * could be also posting tuples and represent a list of
+					 * tids.
+					 */
+					if (BTreeTupleIsPosting(verifiedItup))
+					{
+						nverify = BTreeTupleGetNPosting(verifiedItup);
+						verifyTids = BTreeTupleGetPosting(verifiedItup);
+						for (int i = 1; i < nverify; i++)
+							verifyTids[i] = *BTreeTupleGetPostingN(verifiedItup, i);
+					}
+					else
+						verifyTids = &verifiedItup->t_tid;
+
+					if (BTreeTupleIsPosting(itup))
+					{
+						nresult = BTreeTupleGetNPosting(itup);
+						resultTids = BTreeTupleGetPosting(itup);
+						for (int i = 1; i < nresult; i++)
+							resultTids[i] = *BTreeTupleGetPostingN(itup, i);
+					}
+					else
+						resultTids = &itup->t_tid;
+
+					/* One not equal means they're not equal. */
+					for(int i = 0; i < nverify; i++)
+					{
+						for(int j = 0; j < nresult; j++)
+						{
+							if (!ItemPointerEquals(&resultTids[j], &verifyTids[i]))
+							{
+								nextFound = true;
+								break;
+							}
+						}
+					}
+
+					if (!nextFound)
+						offnum = jumpOffset;
+				}
+
+				if ((offnum > maxoff) && (so->currPos.nextPage == P_NONE))
+				{
+					_bt_relbuf(indexRel, so->currPos.buf);
+					BTScanPosInvalidate(so->currPos);
+
+					pfree(so->skipScanKey);
+					so->skipScanKey = NULL;
+					return false;
+				}
+			}
+			else
+				/*
+				 * If startItup could be not found within the current page,
+				 * assume we found something new
+				 */
+				nextFound = true;
+
+			/* Return original scankey options */
+			so->skipScanKey->keysz = prefix;
+			so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+		}
+	}
+
 	/* Now read the data */
-	if (!_bt_readpage(scan, dir, offnum))
+	if (!_bt_readpage(scan, indexdir, offnum))
 	{
 		/*
 		 * There's no actually-matching data on this page.  Try to advance to
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index ced8933f44..b6245994f0 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -64,7 +64,7 @@
 #include "utils/rel.h"
 #include "utils/syscache.h"
 
-static bool IndexSupportsBackwardScan(Plan *node);
+static bool IndexSupportsBackwardScan(Oid indexid);
 
 
 /*
@@ -555,8 +555,10 @@ ExecSupportsBackwardScan(Plan *node)
 			return false;
 
 		case T_IndexScan:
+			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid);
+
 		case T_IndexOnlyScan:
-			return IndexSupportsBackwardScan(node);
+			return IndexSupportsBackwardScan(((IndexOnlyScan *) node)->indexid);
 
 		case T_SubqueryScan:
 			return ExecSupportsBackwardScan(((SubqueryScan *) node)->subplan);
@@ -596,38 +598,16 @@ ExecSupportsBackwardScan(Plan *node)
 
 /*
  * An IndexScan or IndexOnlyScan node supports backward scan only if the
- * index's AM does and no skip scan is used.
+ * index's AM does.
  */
 static bool
-IndexSupportsBackwardScan(Plan *node)
+IndexSupportsBackwardScan(Oid indexid)
 {
 	bool		result;
-	Oid			indexid = InvalidOid;
-	int         skip_prefix_size = 0;
 	HeapTuple	ht_idxrel;
 	Form_pg_class idxrelrec;
 	IndexAmRoutine *amroutine;
 
-	Assert(IsA(node, IndexScan) || IsA(node, IndexOnlyScan));
-	switch(nodeTag(node))
-	{
-		case T_IndexScan:
-			indexid = ((IndexScan *) node)->indexid;
-			break;
-
-		case T_IndexOnlyScan:
-			indexid = ((IndexOnlyScan *) node)->indexid;
-			skip_prefix_size = ((IndexOnlyScan *) node)->indexskipprefixsize;
-			break;
-
-		default:
-			elog(DEBUG2, "unrecognized node type: %d", (int) nodeTag(node));
-			break;
-	}
-
-	if (skip_prefix_size > 0)
-		return false;
-
 	/* Fetch the pg_class tuple of the index relation */
 	ht_idxrel = SearchSysCache1(RELOID, ObjectIdGetDatum(indexid));
 	if (!HeapTupleIsValid(ht_idxrel))
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 40ad1b949b..d5abac20cb 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -67,6 +67,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid = NULL;
+	ItemPointerData startTid;
 	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
 
 	/*
@@ -75,6 +76,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	 */
 	bool skipped = false;
 
+	/*
+	 * Index only scan must be aware that in case of skipping we can return to
+	 * the starting point due to visibility checks. In this situation we need
+	 * to jump further, and number of skipping attempts tell us how far do we
+	 * need to do so.
+	 */
+	int skipAttempts = 0;
+
 	/*
 	 * extract necessary information from index scan node
 	 */
@@ -123,13 +132,27 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
 	}
+	else
+	{
+		ItemPointerCopy(&scandesc->xs_heaptid, &startTid);
+	}
 
 	/*
 	 * Check if we need to skip to the next key prefix.
+	 *
+	 * When fetching a cursor in the direction opposite to a general scan
+	 * direction, the result must be what normal fetching should have
+	 * returned, but in reversed order. In other words, return the last or
+	 * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+	 * Due to that we skip also when the first tuple wasn't emitted yet, but
+	 * the directions are opposite.
 	 */
-	if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+	if (node->ioss_SkipPrefixSize > 0 &&
+		(node->ioss_FirstTupleEmitted ||
+		 ScanDirectionsAreOpposite(direction, indexonlyscan->indexorderdir)))
 	{
-		if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+		if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir,
+						!node->ioss_FirstTupleEmitted, node->ioss_SkipPrefixSize))
 		{
 			/*
 			 * Reached end of index. At this point currPos is invalidated, and
@@ -144,6 +167,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		else
 		{
 			skipped = true;
+			skipAttempts = 1;
 			tid = &scandesc->xs_heaptid;
 		}
 	}
@@ -161,6 +185,35 @@ IndexOnlyNext(IndexOnlyScanState *node)
 
 		skipped = false;
 
+		/*
+		 * If we already emitted first tuple, while doing index only skip scan
+		 * with advancing and reading in different directions we can return to
+		 * the same position where we started after visibility check. Recognize
+		 * such situations and skip more.
+		 */
+		if ((readDirection != direction) && node->ioss_FirstTupleEmitted &&
+			ItemPointerIsValid(&startTid) && ItemPointerEquals(&startTid, tid))
+		{
+			int i;
+			skipAttempts += 1;
+
+			for (i = 0; i < skipAttempts; i++)
+			{
+				if (!index_skip(scandesc, direction,
+								indexonlyscan->indexorderdir,
+								!node->ioss_FirstTupleEmitted,
+								node->ioss_SkipPrefixSize))
+				{
+					node->ioss_FirstTupleEmitted = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			tid = &scandesc->xs_heaptid;
+		}
+
+		skipped = false;
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index cb2f48a1bc..58e29d054c 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -176,6 +176,7 @@ typedef bool (*amgettuple_function) (IndexScanDesc scan,
 /* skip past duplicates in a given prefix */
 typedef bool (*amskip_function) (IndexScanDesc scan,
 								 ScanDirection dir,
+								 ScanDirection indexdir,
 								 int prefix);
 
 /* fetch all valid tuples */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index d13d95c458..5a6d904af6 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -183,7 +183,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
 												   IndexBulkDeleteResult *istat);
 extern bool index_can_return(Relation indexRelation, int attno);
-extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction,
+					   ScanDirection indexdir, bool start, int prefix);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 2c516654c2..039e8d1f0d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1232,7 +1232,8 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir,
+					 ScanDirection indexdir, int prefix);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 							   Snapshot snapshot);
 
@@ -1257,7 +1258,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
 extern bytea *btoptions(Datum reloptions, bool validate);
-extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir,
+				   ScanDirection indexdir, int prefix);
 extern bool btproperty(Oid index_oid, int attno,
 					   IndexAMProperty prop, const char *propname,
 					   bool *res, bool *isnull);
-- 
2.32.0

v40-0005-Extend-index-skip-scan-with-ScanLooseKey.patchtext/x-diff; charset=us-asciiDownload

From 92521c8ded6d7ae2b802a1b57b72132f5072491f Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sat, 22 Jan 2022 21:13:42 +0100
Subject: [PATCH v40 5/6] Extend index skip scan with ScanLooseKey

Index skip scan relies on the information about key prefix that needs to
be jumped over, but it's represented in a rather limited fashion only
via the prefix size. This approach is sufficient now, but for the sake
of flexibility introduce a concept of ScanLooseKey to represent
underspecified search keys. At the moment it's inspired by the idea of
skip keys constrained in some range of keyspace, and not used in any
way.
---
 src/backend/executor/nodeIndexonlyscan.c | 14 ++++++++++----
 src/include/access/relscan.h             |  2 ++
 src/include/access/skey.h                |  8 ++++++++
 src/include/nodes/execnodes.h            |  5 ++++-
 4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index d5abac20cb..470c364e53 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -147,12 +147,12 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	 * Due to that we skip also when the first tuple wasn't emitted yet, but
 	 * the directions are opposite.
 	 */
-	if (node->ioss_SkipPrefixSize > 0 &&
+	if (node->ioss_ScanLooseKeys != NULL &&
 		(node->ioss_FirstTupleEmitted ||
 		 ScanDirectionsAreOpposite(direction, indexonlyscan->indexorderdir)))
 	{
 		if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir,
-						!node->ioss_FirstTupleEmitted, node->ioss_SkipPrefixSize))
+						!node->ioss_FirstTupleEmitted, node->ioss_NumScanLooseKeys))
 		{
 			/*
 			 * Reached end of index. At this point currPos is invalidated, and
@@ -202,7 +202,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 				if (!index_skip(scandesc, direction,
 								indexonlyscan->indexorderdir,
 								!node->ioss_FirstTupleEmitted,
-								node->ioss_SkipPrefixSize))
+								node->ioss_NumScanLooseKeys))
 				{
 					node->ioss_FirstTupleEmitted = false;
 					return ExecClearTuple(slot);
@@ -594,7 +594,6 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
-	indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
 	indexstate->ioss_FirstTupleEmitted = false;
 
 	/*
@@ -697,6 +696,13 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 						   NULL,	/* no ArrayKeys */
 						   NULL);
 
+	if (node->indexskipprefixsize != 0)
+	{
+		indexstate->ioss_NumScanLooseKeys = node->indexskipprefixsize;
+		indexstate->ioss_ScanLooseKeys =
+			(ScanLooseKey) palloc(node->indexskipprefixsize * sizeof(ScanLooseKeyData));
+	}
+
 	/*
 	 * If we have runtime keys, we need an ExprContext to evaluate them. The
 	 * node's standard context won't do because we want to reset that context
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 53a93ccbe7..5200a1867d 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -119,8 +119,10 @@ typedef struct IndexScanDescData
 	struct SnapshotData *xs_snapshot;	/* snapshot to see */
 	int			numberOfKeys;	/* number of index qualifier conditions */
 	int			numberOfOrderBys;	/* number of ordering operators */
+	int			numberOfLooseKeys;	/* number of loose index qualifier conditions */
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
+	struct ScanLooseKeyData *looseKeyData;	/* array of loose index qualifier descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
 	bool		xs_temp_snap;	/* unregister snapshot at scan end? */
 
diff --git a/src/include/access/skey.h b/src/include/access/skey.h
index b5ab17f7d9..a711dc353d 100644
--- a/src/include/access/skey.h
+++ b/src/include/access/skey.h
@@ -74,6 +74,14 @@ typedef struct ScanKeyData
 
 typedef ScanKeyData *ScanKey;
 
+typedef struct ScanLooseKeyData
+{
+	ScanKey start;
+	ScanKey end;
+} ScanLooseKeyData;
+
+typedef ScanLooseKeyData *ScanLooseKey;
+
 /*
  * About row comparisons:
  *
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f28ec2f830..52709429ad 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1501,6 +1501,8 @@ typedef struct IndexScanState
  *		NumOrderByKeys	   number of OrderByKeys
  *		RuntimeKeys		   info about Skeys that must be evaluated at runtime
  *		NumRuntimeKeys	   number of RuntimeKeys
+ *		ScanLooseKeys 	   Skey structures for loose index quals
+ *		NumScanLooseKeys   number of ScanLooseKeys
  *		RuntimeKeysReady   true if runtime Skeys have been computed
  *		RuntimeContext	   expr context for evaling runtime Skeys
  *		RelationDesc	   index relation descriptor
@@ -1520,13 +1522,14 @@ typedef struct IndexOnlyScanState
 	int			ioss_NumOrderByKeys;
 	IndexRuntimeKeyInfo *ioss_RuntimeKeys;
 	int			ioss_NumRuntimeKeys;
+	struct ScanLooseKeyData *ioss_ScanLooseKeys;
+	int			ioss_NumScanLooseKeys;
 	bool		ioss_RuntimeKeysReady;
 	ExprContext *ioss_RuntimeContext;
 	Relation	ioss_RelationDesc;
 	struct IndexScanDescData *ioss_ScanDesc;
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
-	int         ioss_SkipPrefixSize;
 	bool		ioss_FirstTupleEmitted;
 	Size		ioss_PscanLen;
 } IndexOnlyScanState;
-- 
2.32.0

v40-0006-Index-skip-scan-for-IndexScan.patchtext/x-diff; charset=us-asciiDownload

From 3d32815c738ef56198510fbba842410d47f5c235 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 14 May 2021 19:22:06 +0200
Subject: [PATCH v40 6/6] Index skip scan for IndexScan

Introduce Skip Scan support for IndexScan, not only for IndexOnlyScan.
It works in the same way as IndexOnlyScan, but planned has to check that
the chosen index is fully covering specified distinct expressions.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 src/backend/commands/explain.c                |  6 ++
 src/backend/executor/nodeIndexscan.c          | 56 +++++++++++++++-
 src/backend/nodes/copyfuncs.c                 |  1 +
 src/backend/nodes/outfuncs.c                  |  1 +
 src/backend/nodes/readfuncs.c                 |  1 +
 src/backend/optimizer/path/indxpath.c         | 59 ++++++++++++++++-
 src/backend/optimizer/plan/createplan.c       | 10 ++-
 src/include/nodes/execnodes.h                 |  4 ++
 src/include/nodes/plannodes.h                 |  2 +
 src/test/regress/expected/select_distinct.out | 64 ++++++++++++++++---
 src/test/regress/sql/select_distinct.sql      | 16 +++++
 11 files changed, 206 insertions(+), 14 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 45e03c5207..3f3d662635 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1748,6 +1748,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_IndexScan:
+			if (((IndexScan *) plan)->indexskipprefixsize > 0)
+			{
+				IndexScan  *indexscan = (IndexScan *) plan;
+				ExplainPropertyBool("Skip scan", true, es);
+				ExplainIndexSkipScanKeys(indexscan->indexskipprefixsize, es);
+			}
 			show_scan_qual(((IndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexScan *) plan)->indexqualorig)
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index a91f135be7..c5872b71d5 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,6 +85,13 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
+
+	/*
+	 * tells if the current position was reached via skipping. In this case
+	 * there is no nead for the index_getnext_tid
+	 */
+	bool skipped = false;
 
 	/*
 	 * extract necessary information from index scan node
@@ -92,7 +99,7 @@ IndexNext(IndexScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -117,6 +124,12 @@ IndexNext(IndexScanState *node)
 
 		node->iss_ScanDesc = scandesc;
 
+		/* Index skip scan assumes xs_want_itup, so set it to true */
+		if (indexscan->indexskipprefixsize > 0)
+			node->iss_ScanDesc->xs_want_itup = true;
+		else
+			node->iss_ScanDesc->xs_want_itup = false;
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -127,12 +140,48 @@ IndexNext(IndexScanState *node)
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	}
 
+	/*
+	 * Check if we need to skip to the next key prefix, because we've been
+	 * asked to implement DISTINCT.
+	 *
+	 * When fetching a cursor in the direction opposite to a general scan
+	 * direction, the result must be what normal fetching should have
+	 * returned, but in reversed order. In other words, return the last or
+	 * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+	 * Due to that we skip also when the first tuple wasn't emitted yet, but
+	 * the directions are opposite.
+	 */
+	if (node->iss_SkipPrefixSize > 0 &&
+		(node->iss_FirstTupleEmitted ||
+		 ScanDirectionsAreOpposite(direction, indexscan->indexorderdir)))
+	{
+		if (!index_skip(scandesc, direction, indexscan->indexorderdir,
+					   !node->iss_FirstTupleEmitted, node->iss_SkipPrefixSize))
+		{
+			/*
+			 * Reached end of index. At this point currPos is invalidated, and
+			 * we need to reset iss_FirstTupleEmitted, since otherwise after
+			 * going backwards, reaching the end of index, and going forward
+			 * again we apply skip again. It would be incorrect and lead to an
+			 * extra skipped item.
+			 */
+			node->iss_FirstTupleEmitted = false;
+			return ExecClearTuple(slot);
+		}
+		else
+		{
+			skipped = true;
+			index_fetch_heap(scandesc, slot);
+		}
+	}
+
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (skipped || index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
+		skipped = false;
 
 		/*
 		 * If the index was lossy, we have to recheck the index quals using
@@ -149,6 +198,7 @@ IndexNext(IndexScanState *node)
 			}
 		}
 
+		node->iss_FirstTupleEmitted = true;
 		return slot;
 	}
 
@@ -910,6 +960,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+	indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->iss_FirstTupleEmitted = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index af8678ce51..c318b15daf 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -497,6 +497,7 @@ _copyIndexScan(const IndexScan *from)
 	COPY_NODE_FIELD(indexorderbyorig);
 	COPY_NODE_FIELD(indexorderbyops);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 473d522db3..99ed933482 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -569,6 +569,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
 	WRITE_NODE_FIELD(indexorderbyorig);
 	WRITE_NODE_FIELD(indexorderbyops);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 169fb408d3..0fefbd5cc8 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1868,6 +1868,7 @@ _readIndexScan(void)
 	READ_NODE_FIELD(indexorderbyorig);
 	READ_NODE_FIELD(indexorderbyops);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 00526f3476..41444d2216 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -1036,7 +1036,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 					   check_index_only(rel, index));
 
 	/* Check if an index skip scan is possible. */
-	can_skip = enable_indexskipscan & index->amcanskip & index_only_scan;
+	can_skip = enable_indexskipscan & index->amcanskip;
 
 	if (can_skip)
 	{
@@ -1099,6 +1099,63 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				}
 			}
 		}
+
+		/*
+		 * For an index scan verify that index fully covers distinct
+		 * expressions, otherwise there is not enough information for skipping
+		 */
+		if (!index_only_scan && root->query_uniquekeys != NULL)
+		{
+			ListCell *lc;
+
+			foreach(lc, root->query_uniquekeys)
+			{
+				List *uniqExprs = (List *) lfirst(lc);
+				ListCell *lc1;
+
+				foreach(lc1, uniqExprs)
+				{
+					Expr *expr = (Expr *) lfirst(lc1);
+					bool found = false;
+
+					if (!IsA(expr, Var))
+					{
+						ListCell *lc2;
+
+						foreach(lc2, index->indexprs)
+						{
+							if(equal(lfirst(lc1), lfirst(lc2)))
+							{
+								found = true;
+								break;
+							}
+						}
+					}
+					else
+					{
+						Var *var = (Var *) expr;
+
+						for (int i = 0; i < index->ncolumns; i++)
+						{
+							if (index->indexkeys[i] == var->varattno)
+							{
+								found = true;
+								break;
+							}
+						}
+					}
+
+					if (!found)
+					{
+						can_skip = false;
+						break;
+					}
+				}
+
+				if (!can_skip)
+					break;
+			}
+		}
 	}
 
 	/*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 98738c9d9c..c9e6b7f05a 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -185,7 +185,8 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 								 Oid indexid, List *indexqual, List *indexqualorig,
 								 List *indexorderby, List *indexorderbyorig,
 								 List *indexorderbyops,
-								 ScanDirection indexscandir);
+								 ScanDirection indexscandir,
+								 int skipprefix);
 static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 Index scanrelid, Oid indexid,
 										 List *indexqual, List *recheckqual,
@@ -3105,7 +3106,8 @@ create_indexscan_plan(PlannerInfo *root,
 											fixed_indexorderbys,
 											indexorderbys,
 											indexorderbyops,
-											best_path->indexscandir);
+											best_path->indexscandir,
+											best_path->indexskipprefix);
 
 	copy_generic_path_info(&scan_plan->plan, &best_path->path);
 
@@ -5438,7 +5440,8 @@ make_indexscan(List *qptlist,
 			   List *indexorderby,
 			   List *indexorderbyorig,
 			   List *indexorderbyops,
-			   ScanDirection indexscandir)
+			   ScanDirection indexscandir,
+			   int skipPrefixSize)
 {
 	IndexScan  *node = makeNode(IndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5455,6 +5458,7 @@ make_indexscan(List *qptlist,
 	node->indexorderbyorig = indexorderbyorig;
 	node->indexorderbyops = indexorderbyops;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 52709429ad..d4d2fcb83e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1479,6 +1479,8 @@ typedef struct IndexScanState
 	ExprContext *iss_RuntimeContext;
 	Relation	iss_RelationDesc;
 	struct IndexScanDescData *iss_ScanDesc;
+	int         iss_SkipPrefixSize;
+	bool		iss_FirstTupleEmitted;
 
 	/* These are needed for re-checking ORDER BY expr ordering */
 	pairingheap *iss_ReorderQueue;
@@ -1510,6 +1512,8 @@ typedef struct IndexScanState
  *		TableSlot		   slot for holding tuples fetched from the table
  *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
+ *		SkipPrefixSize	   number of keys for skip-based DISTINCT
+ *		FirstTupleEmitted  has the first tuple been emitted
  * ----------------
  */
 typedef struct IndexOnlyScanState
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6b3eefebc6..c0c91d4d09 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -411,6 +411,8 @@ typedef struct IndexScan
 	List	   *indexorderbyorig;	/* the same in original form */
 	List	   *indexorderbyops;	/* OIDs of sort ops for ORDER BY exprs */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for distinct
+										 * scans */
 } IndexScan;
 
 /* ----------------
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index dfc90c797a..df6745e975 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -445,14 +445,12 @@ SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
 -- test index skip scan for expressions
 EXPLAIN (COSTS OFF)
 SELECT DISTINCT (a + 1) FROM distinct_a ORDER BY (a + 1);
-             QUERY PLAN             
-------------------------------------
- Sort
-   Sort Key: ((a + 1))
-   ->  HashAggregate
-         Group Key: (a + 1)
-         ->  Seq Scan on distinct_a
-(5 rows)
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Scan using distinct_a_expr_idx on distinct_a
+   Skip scan: true
+   Distinct Prefix: 1
+(3 rows)
 
 SELECT DISTINCT (a + 1) FROM distinct_a ORDER BY (a + 1);
  ?column? 
@@ -693,6 +691,56 @@ FETCH BACKWARD ALL FROM c;
 
 END;
 DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+ 2 | 1 | 10
+ 3 | 1 | 10
+ 4 | 1 | 10
+ 5 | 1 | 10
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+   Distinct Prefix: 1
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+   Distinct Prefix: 1
+   Index Cond: (a = 1)
+(4 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT *
+FROM distinct_a;
+          QUERY PLAN          
+------------------------------
+ HashAggregate
+   Group Key: a, b, c
+   ->  Seq Scan on distinct_a
+(3 rows)
+
 -- check colums order
 SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
  a 
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index 3d1ebf515c..2f91f11280 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -294,6 +294,22 @@ END;
 
 DROP TABLE distinct_abc;
 
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT *
+FROM distinct_a;
+
 -- check colums order
 SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
 
-- 
2.32.0

#36

Zhihong Yu

zyu@yugabyte.com

almost 4 years ago

In reply to: Dmitry Dolgov (#35)

Re: Index Skip Scan (new UniqueKeys)

On Sat, Jan 22, 2022 at 1:32 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Fri, May 21, 2021 at 05:31:38PM +0200, Dmitry Dolgov wrote:
Hi,

Here is another take on the patch with a couple of changes:

* I've removed for now UniqueKeys parts. The interaction of skip scan &
unique keys patch was actually not that big, so the main difference is
that now the structure itself went away, a list of unique expressions
is used instead. All the suggestions about how this feature should
look like from the planning perspective are still there. On the one
hand it will allow to develop both patches independently and avoid
confusion for reviewers, on the other UniqueKeys could be easily
incorporated back when needed.

* Support for skipping in case of moving backward on demand (scroll
cursor) is moved into a separate patch. This is implemented via
returning false from IndexSupportsBackwardScan in case if it's a skip
scan node, which in turn adds Materialize node on top when needed. The
name SupportsBackwardScan was a bit confusing for me, but it seems
it's only being used with a cursorOptions check for CURSOR_OPT_SCROLL.
Eventually those cases when BackwardScanDirection is used are still
handled by amskip. This change didn't affect the test coverage, all
the test cases supported in previous patch versions are still there.

About Materialize node, I guess it could be less performant than a
"native" support, but it simplifies the implementation significantly
to the point that most parts, which were causing questions before, are
now located in the isolated patch. My idea here is to concentrate
efforts on the first three patches in this series, and consider the
rest of them as an experiment field.

* IndexScan support was also relocated into a separate patch, the first
three patches are now only about IndexOnlyScan.

* Last bits of reviews were incorporated and rebased.

While the patch is still waiting for a review, I was motivated by the
thread [1] to think about it from the interface point of view. Consider
an index skip scan being just like a normal index scan with a set of
underspecified leading search keys. It makes sense to have the same
structure "begin scan" -> "get the next tuple" -> "end scan" (now I'm
not sure if amskip is a good name to represent that, don't have anything
better yet). But the "underspecified" part is currently indeed
interpreted in a limited way -- as "missing" keys -- and is expressed
only via the prefix size. Another option would be e.g. leading keys
constrained by a range of values, so generally speaking it makes sense
to extend amount of the information provided for skipping.

As a naive approach I've added a new patch into the series, containing
the extra data structure (ScanLooseKeys, doesn't have much meaning yet
except somehow representing keys for skipping) used for index skip scan.
Any thoughts about it?

Besides that the new patch version contains some cleaning up and
addresses commentaries around leaf page pinning from [1]. The idea
behind the series structure is still the same: the first three patches
contains the essence of the implementation (hoping to help concentrate
review), the rest are more "experimental".

[1]:
/messages/by-id/CAH2-WzmUscvoxVkokHxP=uPTDjSi0tJkFpUPD-CeA35dvn-CMw@mail.gmail.com

Hi,

+ /* If same EC already is already in the list, then not unique */

The word already is duplicated.

+ * make_pathkeys_for_uniquekeyclauses

The func name in the comment is different from the actual func name.

The year should be 2022 :-)

make_pathkeys_for_uniquekeys() is called by build_uniquekeys(). Should
make_pathkeys_for_uniquekeys() be moved into uniquekeys.c ?

+query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys,
+                        bool allow_multinulls)

It seems allow_multinulls is not checked in the func. Can the parameter be
removed ?

+           Path       *newpath;
+
+           newpath = (Path *) create_projection_path(root, rel, subpath,
+                                                     scanjoin_target);

You can remove variable newpath and assign to lfirst(lc) directly.

 +add_path(RelOptInfo *parent_rel, Path *new_path)
+add_unique_path(RelOptInfo *parent_rel, Path *new_path)

It seems the above two func's can be combined into one func which
takes parent_rel->pathlist / parent_rel->unique_pathlist as third parameter.

Cheers

#37