Generic Index Skip Scan

Started by Floris Van Neeover 5 years ago3 messages

florisvannee@Optiver.com

over 5 years ago

5 attachment(s)

Besides the great efforts that Dmitry et al. are putting into the skip scan for DISTINCT queries [1]/messages/by-id/20200609102247.jdlatmfyeecg52fi@localhost, I'm also still keen on extending the use of it further. I'd like to address the limited cases in which skipping can occur here. A few months ago I shared an initial rough patch that provided a generic skip implementation, but lacked the proper planning work [2]/messages/by-id/c5c5c974714a47f1b226c857699e8680@opammb0561.comp.optiver.com. I'd like to share a second patch set that provides an implementation of the planner as well. Perhaps this can lead to some proper discussions how we'd like to shape this patch further.

Please see [2]/messages/by-id/c5c5c974714a47f1b226c857699e8680@opammb0561.comp.optiver.com for an introduction and some rough performance comparisons. This patch improves upon those, because it implements proper cost estimation logic. It will now only choose the skip scan if it's deemed to be cheaper than using a regular index scan. Other than that, all the features are still there. The skip scan can be used in many more types of queries than in the original DISTINCT patch as provided in [1]/messages/by-id/20200609102247.jdlatmfyeecg52fi@localhost, making it more performant and also more predictable for users.

I'm keen on receiving feedback on this idea and on the patch. I believe it could be a great feature that is useful to many users. However, when I posted the previous version of the patch, only Thomas expressed his explicit interest in the feature. It would be useful for me to know if there's enough interest here. Please speak out as well if you can't (currently) review, but do think that this feature is worth the effort.

I'm sure there are still plenty of things that need to be improved. I have some in mind, but at the moment it's hard for me to judge which ones are really important and which ones are not. I think I really need someone with more experience of the code looking at this for feedback.

v9-0001 + v9-0002 are Andy's UniqueKeys patches [3]/messages/by-id/CAKU4AWrwZMAL=uaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw@mail.gmail.com
v01-0001 is a slightly modified version of Dmitry's extension of unique keys patch (his lastest patch plus the diff patch that I posted in the original index skip thread)
v01-0002 is the bulk of the work: the skip implementation, indexam interface and implementation for DISTINCT queries
v01-0003 is the additional planner work to add support for skipping in regular index scans (non-DISTINCT)

-Floris

[1]: /messages/by-id/20200609102247.jdlatmfyeecg52fi@localhost
[2]: /messages/by-id/c5c5c974714a47f1b226c857699e8680@opammb0561.comp.optiver.com
[3]: /messages/by-id/CAKU4AWrwZMAL=uaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw@mail.gmail.com

Attachments:

v9-0001-Introduce-RelOptInfo-notnullattrs-attribute.patchapplication/octet-stream; name=v9-0001-Introduce-RelOptInfo-notnullattrs-attribute.patchDownload

From 72b4401e8f7b427b22578de90e63321d0ba29c36 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Sun, 3 May 2020 22:37:46 +0800
Subject: [PATCH 1/5] Introduce RelOptInfo->notnullattrs attribute

The notnullattrs is calculated from catalog and run-time query. That
infomation is translated to child relation as well for partitioned
table.
---
 src/backend/optimizer/path/allpaths.c  | 31 ++++++++++++++++++++++++++
 src/backend/optimizer/plan/initsplan.c | 10 +++++++++
 src/backend/optimizer/util/plancat.c   | 10 +++++++++
 src/include/nodes/pathnodes.h          |  2 ++
 4 files changed, 53 insertions(+)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 6da0dcd61c..484dab0a1a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1005,6 +1005,7 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 		RelOptInfo *childrel;
 		ListCell   *parentvars;
 		ListCell   *childvars;
+		int i = -1;
 
 		/* append_rel_list contains all append rels; ignore others */
 		if (appinfo->parent_relid != parentRTindex)
@@ -1061,6 +1062,36 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 								   (Node *) rel->reltarget->exprs,
 								   1, &appinfo);
 
+		/* Copy notnullattrs. */
+		while ((i = bms_next_member(rel->notnullattrs, i)) > 0)
+		{
+			AttrNumber attno = i + FirstLowInvalidHeapAttributeNumber;
+			AttrNumber child_attno;
+			if (attno == 0)
+			{
+				/* Whole row is not null, so must be same for child */
+				childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+														attno - FirstLowInvalidHeapAttributeNumber);
+				break;
+			}
+			if (attno < 0 )
+				/* no need to translate system column */
+				child_attno = attno;
+			else
+			{
+				Node * node = list_nth(appinfo->translated_vars, attno - 1);
+				if (!IsA(node, Var))
+					/* This may happens at UNION case, like (SELECT a FROM t1 UNION SELECT a + 3
+					 * FROM t2) t and we know t.a is not null
+					 */
+					continue;
+				child_attno = castNode(Var, node)->varattno;
+			}
+
+			childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+													child_attno - FirstLowInvalidHeapAttributeNumber);
+		}
+
 		/*
 		 * We have to make child entries in the EquivalenceClass data
 		 * structures as well.  This is needed either if the parent
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index e978b491f6..95b1b14cd3 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -830,6 +830,16 @@ deconstruct_recurse(PlannerInfo *root, Node *jtnode, bool below_outer_join,
 		{
 			Node	   *qual = (Node *) lfirst(l);
 
+			/* Set the not null info now */
+			ListCell	*lc;
+			List		*non_nullable_vars = find_nonnullable_vars(qual);
+			foreach(lc, non_nullable_vars)
+			{
+				Var *var = lfirst_node(Var, lc);
+				RelOptInfo *rel = root->simple_rel_array[var->varno];
+				rel->notnullattrs = bms_add_member(rel->notnullattrs,
+												   var->varattno - FirstLowInvalidHeapAttributeNumber);
+			}
 			distribute_qual_to_rels(root, qual,
 									false, below_outer_join, JOIN_INNER,
 									root->qual_security_level,
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 25545029d7..0b2f9d398a 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -117,6 +117,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	Relation	relation;
 	bool		hasindex;
 	List	   *indexinfos = NIL;
+	int			i;
 
 	/*
 	 * We need not lock the relation since it was already locked, either by
@@ -463,6 +464,15 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	if (inhparent && relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		set_relation_partition_info(root, rel, relation);
 
+	Assert(rel->notnullattrs == NULL);
+	for(i = 0; i < relation->rd_att->natts; i++)
+	{
+		FormData_pg_attribute attr = relation->rd_att->attrs[i];
+		if (attr.attnotnull)
+			rel->notnullattrs = bms_add_member(rel->notnullattrs,
+											   attr.attnum - FirstLowInvalidHeapAttributeNumber);
+	}
+
 	table_close(relation, NoLock);
 
 	/*
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 485d1b06c9..9e3ebd488a 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -709,6 +709,8 @@ typedef struct RelOptInfo
 	PlannerInfo *subroot;		/* if subquery */
 	List	   *subplan_params; /* if subquery */
 	int			rel_parallel_workers;	/* wanted number of parallel workers */
+	/* Not null attrs, start from -FirstLowInvalidHeapAttributeNumber */
+	Bitmapset		*notnullattrs;
 
 	/* Information about foreign tables and foreign joins */
 	Oid			serverid;		/* identifies server for the table or join */
-- 
2.27.0

v9-0002-Introduce-UniqueKey-attributes-on-RelOptInfo-stru.patchapplication/octet-stream; name=v9-0002-Introduce-UniqueKey-attributes-on-RelOptInfo-stru.patchDownload

From 4c52037374e00a1e71c02004ef01337be6008dd0 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Mon, 11 May 2020 15:50:52 +0800
Subject: [PATCH 2/5] Introduce UniqueKey attributes on RelOptInfo struct.

UniqueKey is a set of exprs on RelOptInfo which represents the exprs
will be unique on the given RelOptInfo. You can see README.uniquekey
for more information.
---
 src/backend/nodes/copyfuncs.c               |   13 +
 src/backend/nodes/list.c                    |   31 +
 src/backend/nodes/makefuncs.c               |   13 +
 src/backend/nodes/outfuncs.c                |   11 +
 src/backend/nodes/readfuncs.c               |   10 +
 src/backend/optimizer/path/Makefile         |    3 +-
 src/backend/optimizer/path/README.uniquekey |  131 +++
 src/backend/optimizer/path/allpaths.c       |   10 +
 src/backend/optimizer/path/joinpath.c       |    9 +-
 src/backend/optimizer/path/joinrels.c       |    2 +
 src/backend/optimizer/path/pathkeys.c       |    3 +-
 src/backend/optimizer/path/uniquekeys.c     | 1131 +++++++++++++++++++
 src/backend/optimizer/plan/planner.c        |   13 +-
 src/backend/optimizer/prep/prepunion.c      |    2 +
 src/backend/optimizer/util/appendinfo.c     |   44 +
 src/backend/optimizer/util/inherit.c        |   16 +-
 src/include/nodes/makefuncs.h               |    3 +
 src/include/nodes/nodes.h                   |    1 +
 src/include/nodes/pathnodes.h               |   29 +-
 src/include/nodes/pg_list.h                 |    2 +
 src/include/optimizer/appendinfo.h          |    3 +
 src/include/optimizer/optimizer.h           |    2 +
 src/include/optimizer/paths.h               |   43 +
 23 files changed, 1502 insertions(+), 23 deletions(-)
 create mode 100644 src/backend/optimizer/path/README.uniquekey
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 89c409de66..1f50400fd2 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2273,6 +2273,16 @@ _copyPathKey(const PathKey *from)
 	return newnode;
 }
 
+static UniqueKey *
+_copyUniqueKey(const UniqueKey *from)
+{
+	UniqueKey	*newnode = makeNode(UniqueKey);
+
+	COPY_NODE_FIELD(exprs);
+	COPY_SCALAR_FIELD(multi_nullvals);
+
+	return newnode;
+}
 /*
  * _copyRestrictInfo
  */
@@ -5152,6 +5162,9 @@ copyObjectImpl(const void *from)
 		case T_PathKey:
 			retval = _copyPathKey(from);
 			break;
+		case T_UniqueKey:
+			retval = _copyUniqueKey(from);
+			break;
 		case T_RestrictInfo:
 			retval = _copyRestrictInfo(from);
 			break;
diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index 80fa8c84e4..a7a99b70f2 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -687,6 +687,37 @@ list_member_oid(const List *list, Oid datum)
 	return false;
 }
 
+/*
+ * return true iff every entry in "members" list is also present
+ * in the "target" list.
+ */
+bool
+list_is_subset(const List *members, const List *target)
+{
+	const ListCell	*lc1, *lc2;
+
+	Assert(IsPointerList(members));
+	Assert(IsPointerList(target));
+	check_list_invariants(members);
+	check_list_invariants(target);
+
+	foreach(lc1, members)
+	{
+		bool found = false;
+		foreach(lc2, target)
+		{
+			if (equal(lfirst(lc1), lfirst(lc2)))
+			{
+				found = true;
+				break;
+			}
+		}
+		if (!found)
+			return false;
+	}
+	return true;
+}
+
 /*
  * Delete the n'th cell (counting from 0) in list.
  *
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 49de285f01..646cf7c9a1 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -814,3 +814,16 @@ makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols)
 	v->va_cols = va_cols;
 	return v;
 }
+
+
+/*
+ * makeUniqueKey
+ */
+UniqueKey*
+makeUniqueKey(List *exprs, bool multi_nullvals)
+{
+	UniqueKey * ukey = makeNode(UniqueKey);
+	ukey->exprs = exprs;
+	ukey->multi_nullvals = multi_nullvals;
+	return ukey;
+}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515d..c3a9632992 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2428,6 +2428,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+	WRITE_NODE_FIELD(exprs);
+	WRITE_BOOL_FIELD(multi_nullvals);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4127,6 +4135,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 				_outPathKey(str, obj);
 				break;
+			case T_UniqueKey:
+				_outUniqueKey(str, obj);
+				break;
 			case T_PathTarget:
 				_outPathTarget(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab719..3a18571d0c 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -452,6 +452,14 @@ _readSetOperationStmt(void)
 	READ_DONE();
 }
 
+static UniqueKey *
+_readUniqueKey(void)
+{
+	READ_LOCALS(UniqueKey);
+	READ_NODE_FIELD(exprs);
+	READ_BOOL_FIELD(multi_nullvals);
+	READ_DONE();
+}
 
 /*
  *	Stuff from primnodes.h.
@@ -2656,6 +2664,8 @@ parseNodeString(void)
 		return_value = _readCommonTableExpr();
 	else if (MATCH("SETOPERATIONSTMT", 16))
 		return_value = _readSetOperationStmt();
+	else if (MATCH("UNIQUEKEY", 9))
+		return_value = _readUniqueKey();
 	else if (MATCH("ALIAS", 5))
 		return_value = _readAlias();
 	else if (MATCH("RANGEVAR", 8))
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..7b9820c25f 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekeys.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/README.uniquekey b/src/backend/optimizer/path/README.uniquekey
new file mode 100644
index 0000000000..5eac761995
--- /dev/null
+++ b/src/backend/optimizer/path/README.uniquekey
@@ -0,0 +1,131 @@
+1. What is UniqueKey?
+We can think UniqueKey is a set of exprs for a RelOptInfo, which we are insure
+that doesn't yields same result among all the rows. The simplest UniqueKey
+format is primary key.
+
+However we define the UnqiueKey as below.
+
+typedef struct UniqueKey
+{
+        NodeTag	type;
+        List	*exprs;
+        bool	multi_nullvals;
+} UniqueKey;
+
+exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+is a special case of UniqueKey, which means there is only one row in that
+relation.it has a stronger semantic than others. like SELECT uk FROM t; uk is
+normal unique key and may have different values. SELECT colx FROM t WHERE uk =
+const.  colx is unique AND we have only 1 value. This field can used for
+innerrel_is_unique. this logic is handled specially in add_uniquekey_for_onerow
+function.
+
+multi_nullvals: true means multi null values may exist in these exprs, so the
+uniqueness is not guaranteed in this case. This field is necessary for
+remove_useless_join & reduce_unique_semijoins where we don't mind these
+duplicated NULL values. It is set to true for 2 cases. One is a unique key
+from a unique index but the related column is nullable. The other one is for
+outer join. see populate_joinrel_uniquekeys for detail.
+
+
+The UniqueKey can be used at the following cases at least:
+1. remove_useless_joins.
+2. reduce_semianti_joins
+3. remove distinct node if distinct clause is unique.
+4. remove aggnode if group by clause is unique.
+5. Index Skip Scan (WIP)
+6. Aggregation Push Down without 2 phase aggregation if the join can't
+   duplicated the aggregated rows. (work in progress feature)
+
+2. How is it maintained?
+
+We have a set of populate_xxx_unqiuekeys functions to maintain the uniquekey on
+various cases. xxx includes baserel, joinrel, partitionedrel, distinctrel,
+groupedrel, unionrel. and we also need to convert the uniquekey from subquery
+to outer relation, which is what convert_subquery_uniquekeys does.
+
+1. The first part is about baserel. We handled 3 cases. suppose we have Unique
+Index on (a, b).
+
+1. SELECT a, b FROM t.  UniqueKey (a, b)
+2. SELECT a FROM t WHERE b = 1;  UniqueKey (a)
+3. SELECT .. FROM t WHERE a = 1 AND b = 1;  UniqueKey (NIL).  onerow case, every
+   column is Unique.
+
+2. The next part is joinrel, this part is most error-prone, we simplified the rules
+like below:
+1. If the relation's UniqueKey can't be duplicated after join,  then is will be
+   still valid for the join rel. The function we used here is
+   innerrel_keeps_unique. The basic idea is innerrel.any_col = outer.uk.
+
+2. If the UnqiueKey can't keep valid via the rule 1, the combination of the
+   UniqueKey from both sides are valid for sure.  We can prove this as: if the
+   unique exprs from rel1 is duplicated by rel2, the duplicated rows must
+   contains different unique exprs from rel2.
+
+More considerations about onerow:
+1. If relation with one row and it can't be duplicated, it is still possible
+   contains mulit_nullvas after outer join.
+2. If the either UniqueKey can be duplicated after join, the can get one row
+   only when both side is one row AND there is no outer join.
+3. Whenever the onerow UniqueKey is not a valid any more, we need to convert one
+   row UniqueKey to normal unique key since we don't store exprs for one-row
+   relation. get_exprs_from_uniquekeys will be used here.
+
+
+More considerations about multi_nullvals after join:
+1. If the original UnqiueKey has multi_nullvals, the final UniqueKey will have
+   mulit_nullvals in any case.
+2. If a unique key doesn't allow mulit_nullvals, after some outer join, it
+   allows some outer join.
+
+
+3. When we comes to subquery, we need to convert_subquery_unqiuekeys just like
+convert_subquery_pathkeys.  Only the UniqueKey insides subquery is referenced as
+a Var in outer relation will be reused. The relationship between the outerrel.Var
+and subquery.exprs is built with outerel->subroot->processed_tlist.
+
+
+4. As for the SRF functions, it will break the uniqueness of uniquekey, However it
+is handled in adjust_paths_for_srfs, which happens after the query_planner. so
+we will maintain the UniqueKey until there and reset it to NIL at that
+places. This can't help on distinct/group by elimination cases but probably help
+in some other cases, like reduce_unqiue_semijoins/remove_useless_joins and it is
+semantic correctly.
+
+
+5. As for inherit table, we first main the UnqiueKey on childrel as well. But for
+partitioned table we need to maintain 2 different kinds of
+UnqiueKey. 1). UniqueKey on the parent relation 2). UniqueKey on child
+relation for partition wise query.
+
+Example:
+CREATE TABLE p (a int not null, b int not null) partition by list (a);
+CREATE TABLE p0 partition of p for values in (1);
+CREATE TABLE p1 partition of p for values in (2);
+
+create unique index p0_b on p0(b);
+create unique index p1_b on p1(b);
+
+Now b is only unique on partition level, so the distinct can't be removed on
+the following cases. SELECT DISTINCT b FROM p;
+
+Another example is SELECT DISTINCT a, b FROM p WHERE a = 1; Since only one
+partition is chosen, the UniqueKey on child relation is same as the UniqueKey on
+parent relation.
+
+Another usage of UniqueKey on partition level is it be helpful for
+partition-wise join.
+
+As for the UniqueKey on parent table level, it comes with 2 different ways,
+1). the UniqueKey is also derived in UniqueKey index, but the index must be same
+in all the related children relations and the unique index must contains
+Partition Key in it. Example:
+
+CREATE UNIQUE INDEX p_ab ON p(a, b);  -- where a is the partition key.
+
+-- Query
+SELECT a, b FROM p; the (a, b) is a UniqueKey of p.
+
+2). If the parent relation has only one childrel, the UniqueKey on childrel is
+ the UniqueKey on parent as well.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 484dab0a1a..2ad9d06d7a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -579,6 +579,12 @@ set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	 */
 	check_index_predicates(root, rel);
 
+	/*
+	 * Now that we've marked which partial indexes are suitable, we can now
+	 * build the relation's unique keys.
+	 */
+	populate_baserel_uniquekeys(root, rel, rel->indexlist);
+
 	/* Mark rel with estimated output rows, width, etc */
 	set_baserel_size_estimates(root, rel);
 }
@@ -1310,6 +1316,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 
 	/* Add paths to the append relation. */
 	add_paths_to_append_rel(root, rel, live_childrels);
+	if (IS_PARTITIONED_REL(rel))
+		populate_partitionedrel_uniquekeys(root, rel, live_childrels);
 }
 
 
@@ -2383,6 +2391,8 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 										  pathkeys, required_outer));
 	}
 
+	convert_subquery_uniquekeys(root, rel, sub_final_rel);
+
 	/* If outer rel allows parallelism, do same for partial paths. */
 	if (rel->consider_parallel && bms_is_empty(required_outer))
 	{
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index db54a6ba2e..ef0fd2fb0b 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -71,13 +71,6 @@ static void consider_parallel_mergejoin(PlannerInfo *root,
 static void hash_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 								 RelOptInfo *outerrel, RelOptInfo *innerrel,
 								 JoinType jointype, JoinPathExtraData *extra);
-static List *select_mergejoin_clauses(PlannerInfo *root,
-									  RelOptInfo *joinrel,
-									  RelOptInfo *outerrel,
-									  RelOptInfo *innerrel,
-									  List *restrictlist,
-									  JoinType jointype,
-									  bool *mergejoin_allowed);
 static void generate_mergejoin_paths(PlannerInfo *root,
 									 RelOptInfo *joinrel,
 									 RelOptInfo *innerrel,
@@ -1927,7 +1920,7 @@ hash_inner_and_outer(PlannerInfo *root,
  * if it is mergejoinable and involves vars from the two sub-relations
  * currently of interest.
  */
-static List *
+List *
 select_mergejoin_clauses(PlannerInfo *root,
 						 RelOptInfo *joinrel,
 						 RelOptInfo *outerrel,
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 2d343cd293..b9163ee8ff 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -924,6 +924,8 @@ populate_joinrel_with_paths(PlannerInfo *root, RelOptInfo *rel1,
 
 	/* Apply partitionwise join technique, if possible. */
 	try_partitionwise_join(root, rel1, rel2, joinrel, sjinfo, restrictlist);
+
+	populate_joinrel_uniquekeys(root, joinrel, rel1, rel2, restrictlist, sjinfo->jointype);
 }
 
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ce9bf87e9b..7e596d4194 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -33,7 +33,6 @@ static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
 											 int partkeycol);
-static Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 static bool right_merge_direction(PlannerInfo *root, PathKey *pathkey);
 
 
@@ -1035,7 +1034,7 @@ convert_subquery_pathkeys(PlannerInfo *root, RelOptInfo *rel,
  * We need this to ensure that we don't return pathkeys describing values
  * that are unavailable above the level of the subquery scan.
  */
-static Var *
+Var *
 find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle)
 {
 	ListCell   *lc;
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
new file mode 100644
index 0000000000..b33bcd2f32
--- /dev/null
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -0,0 +1,1131 @@
+/*-------------------------------------------------------------------------
+ *
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/uniquekeys.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/appendinfo.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/tlist.h"
+#include "rewrite/rewriteManip.h"
+
+
+/*
+ * This struct is used to help populate_joinrel_uniquekeys.
+ *
+ * added_to_joinrel is true if a uniquekey (from outerrel or innerrel)
+ * has been added to joinrel.
+ * useful is true if the exprs of the uniquekey still appears in joinrel.
+ */
+typedef struct UniqueKeyContextData
+{
+	UniqueKey	*uniquekey;
+	bool	added_to_joinrel;
+	bool	useful;
+} *UniqueKeyContext;
+
+static List *initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel);
+static bool innerrel_keeps_unique(PlannerInfo *root,
+								  RelOptInfo *outerrel,
+								  RelOptInfo *innerrel,
+								  List *restrictlist,
+								  bool reverse);
+
+static List *get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+										List *const_exprs,
+										List *const_expr_opfamilies,
+										Bitmapset *used_varattrs,
+										bool *useful,
+										bool *multi_nullvals);
+static List *get_exprs_from_uniquekey(RelOptInfo *joinrel,
+									  RelOptInfo *rel1,
+									  UniqueKey *ukey);
+static void add_uniquekey_for_onerow(RelOptInfo *rel);
+static bool add_combined_uniquekey(RelOptInfo *joinrel,
+								   RelOptInfo *outer_rel,
+								   RelOptInfo *inner_rel,
+								   UniqueKey *outer_ukey,
+								   UniqueKey *inner_ukey,
+								   JoinType jointype);
+
+/* Used for unique indexes checking for partitioned table */
+static bool index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind);
+static IndexOptInfo *simple_copy_indexinfo_to_parent(PlannerInfo *root,
+													 RelOptInfo *parentrel,
+													 IndexOptInfo *from);
+static bool simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2);
+static void adjust_partition_unique_indexlist(PlannerInfo *root,
+											  RelOptInfo *parentrel,
+											  RelOptInfo *childrel,
+											  List **global_unique_index);
+
+/* Helper function for grouped relation and distinct relation. */
+static void add_uniquekey_from_sortgroups(PlannerInfo *root,
+										  RelOptInfo *rel,
+										  List *sortgroups);
+
+/*
+ * populate_baserel_uniquekeys
+ *		Populate 'baserel' uniquekeys list by looking at the rel's unique index
+ * and baserestrictinfo
+ */
+void
+populate_baserel_uniquekeys(PlannerInfo *root,
+							RelOptInfo *baserel,
+							List *indexlist)
+{
+	ListCell *lc;
+	List	*matched_uniq_indexes = NIL;
+
+	/* Attrs appears in rel->reltarget->exprs. */
+	Bitmapset *used_attrs = NULL;
+
+	List	*const_exprs = NIL;
+	List	*expr_opfamilies = NIL;
+
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	foreach(lc, indexlist)
+	{
+		IndexOptInfo *ind = (IndexOptInfo *) lfirst(lc);
+		if (!ind->unique || !ind->immediate ||
+			(ind->indpred != NIL && !ind->predOK))
+			continue;
+		matched_uniq_indexes = lappend(matched_uniq_indexes, ind);
+	}
+
+	if (matched_uniq_indexes  == NIL)
+		return;
+
+	/* Check which attrs is used in baserel->reltarget */
+	pull_varattnos((Node *)baserel->reltarget->exprs, baserel->relid, &used_attrs);
+
+	/* Check which attrno is used at a mergeable const filter */
+	foreach(lc, baserel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+		if (rinfo->mergeopfamilies == NIL)
+			continue;
+
+		if (bms_is_empty(rinfo->left_relids))
+		{
+			const_exprs = lappend(const_exprs, get_rightop(rinfo->clause));
+		}
+		else if (bms_is_empty(rinfo->right_relids))
+		{
+			const_exprs = lappend(const_exprs, get_leftop(rinfo->clause));
+		}
+		else
+			continue;
+
+		expr_opfamilies = lappend(expr_opfamilies, rinfo->mergeopfamilies);
+	}
+
+	foreach(lc, matched_uniq_indexes)
+	{
+		bool	multi_nullvals, useful;
+		List	*exprs = get_exprs_from_uniqueindex(lfirst_node(IndexOptInfo, lc),
+													const_exprs,
+													expr_opfamilies,
+													used_attrs,
+													&useful,
+													&multi_nullvals);
+		if (useful)
+		{
+			if (exprs == NIL)
+			{
+				/* All the columns in Unique Index matched with a restrictinfo */
+				add_uniquekey_for_onerow(baserel);
+				return;
+			}
+			baserel->uniquekeys = lappend(baserel->uniquekeys,
+										  makeUniqueKey(exprs, multi_nullvals));
+		}
+	}
+}
+
+
+/*
+ * populate_partitionedrel_uniquekeys
+ * The UniqueKey on partitionrel comes from 2 cases:
+ * 1). Only one partition is involved in this query, the unique key can be
+ * copied to parent rel from childrel.
+ * 2). There are some unique index which includes partition key and exists
+ * in all the related partitions.
+ * We never mind rule 2 if we hit rule 1.
+ */
+
+void
+populate_partitionedrel_uniquekeys(PlannerInfo *root,
+								   RelOptInfo *rel,
+								   List *childrels)
+{
+	ListCell	*lc;
+	List	*global_uniq_indexlist = NIL;
+	RelOptInfo *childrel;
+	bool is_first = true;
+
+	Assert(IS_PARTITIONED_REL(rel));
+
+	if (childrels == NIL)
+		return;
+
+	/*
+	 * If there is only one partition used in this query, the UniqueKey in childrel is
+	 * still valid in parent level, but we need convert the format from child expr to
+	 * parent expr.
+	 */
+	if (list_length(childrels) == 1)
+	{
+		/* Check for Rule 1 */
+		RelOptInfo *childrel = linitial_node(RelOptInfo, childrels);
+		ListCell	*lc;
+		Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL);
+		if (relation_is_onerow(childrel))
+		{
+			add_uniquekey_for_onerow(rel);
+			return;
+		}
+
+		foreach(lc, childrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			AppendRelInfo *appinfo = find_appinfo_by_child(root, childrel->relid);
+			List *parent_exprs = NIL;
+			bool can_reuse = true;
+			ListCell	*lc2;
+			foreach(lc2, ukey->exprs)
+			{
+				Var *var = (Var *)lfirst(lc2);
+				/*
+				 * If the expr comes from a expression, it is hard to build the expression
+				 * in parent so ignore that case for now.
+				 */
+				if(!IsA(var, Var))
+				{
+					can_reuse = false;
+					break;
+				}
+				/* Convert it to parent var */
+				parent_exprs = lappend(parent_exprs, find_parent_var(appinfo, var));
+			}
+			if (can_reuse)
+				rel->uniquekeys = lappend(rel->uniquekeys,
+										  makeUniqueKey(parent_exprs,
+														ukey->multi_nullvals));
+		}
+	}
+	else
+	{
+		/* Check for rule 2 */
+		childrel = linitial_node(RelOptInfo, childrels);
+		foreach(lc, childrel->indexlist)
+		{
+			IndexOptInfo *ind = lfirst(lc);
+			IndexOptInfo *modified_index;
+			if (!ind->unique || !ind->immediate ||
+				(ind->indpred != NIL && !ind->predOK))
+				continue;
+
+			/*
+			 * During simple_copy_indexinfo_to_parent, we need to convert var from
+			 * child var to parent var, index on expression is too complex to handle.
+			 * so ignore it for now.
+			 */
+			if (ind->indexprs != NIL)
+				continue;
+
+			modified_index = simple_copy_indexinfo_to_parent(root, rel, ind);
+			/*
+			 * If the unique index doesn't contain partkey, then it is unique
+			 * on this partition only, so it is useless for us.
+			 */
+			if (!index_constains_partkey(rel, modified_index))
+				continue;
+
+			global_uniq_indexlist = lappend(global_uniq_indexlist,  modified_index);
+		}
+
+		if (global_uniq_indexlist != NIL)
+		{
+			foreach(lc, childrels)
+			{
+				RelOptInfo *child = lfirst(lc);
+				if (is_first)
+				{
+					is_first = false;
+					continue;
+				}
+				adjust_partition_unique_indexlist(root, rel, child, &global_uniq_indexlist);
+			}
+			/* Now we have a list of unique index which are exactly same on all childrels,
+			 * Set the UniqueKey just like it is non-partition table
+			 */
+			populate_baserel_uniquekeys(root, rel, global_uniq_indexlist);
+		}
+	}
+}
+
+
+/*
+ * populate_distinctrel_uniquekeys
+ */
+void
+populate_distinctrel_uniquekeys(PlannerInfo *root,
+								RelOptInfo *inputrel,
+								RelOptInfo *distinctrel)
+{
+	/* The unique key before the distinct is still valid. */
+	distinctrel->uniquekeys = list_copy(inputrel->uniquekeys);
+	add_uniquekey_from_sortgroups(root, distinctrel, root->parse->distinctClause);
+}
+
+/*
+ * populate_grouprel_uniquekeys
+ */
+void
+populate_grouprel_uniquekeys(PlannerInfo *root,
+							 RelOptInfo *grouprel,
+							 RelOptInfo *inputrel)
+
+{
+	Query *parse = root->parse;
+	bool input_ukey_added = false;
+	ListCell *lc;
+
+	if (relation_is_onerow(inputrel))
+	{
+		add_uniquekey_for_onerow(grouprel);
+		return;
+	}
+	if (parse->groupingSets)
+		return;
+
+	/* A Normal group by without grouping set. */
+	if (parse->groupClause)
+	{
+		/*
+		 * Current even the groupby clause is Unique already, but if query has aggref
+		 * We have to create grouprel still. To keep the UnqiueKey short, we will check
+		 * the UniqueKey of input_rel still valid, if so we reuse it.
+		 */
+		foreach(lc, inputrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(ukey->exprs, grouprel->reltarget->exprs))
+			{
+				grouprel->uniquekeys = lappend(grouprel->uniquekeys,
+											   ukey);
+				input_ukey_added = true;
+			}
+		}
+		if (!input_ukey_added)
+			/*
+			 * group by clause must be a super-set of grouprel->reltarget->exprs except the
+			 * aggregation expr, so if such exprs is unique already, no bother to generate
+			 * new uniquekey for group by exprs.
+			 */
+			add_uniquekey_from_sortgroups(root,
+										  grouprel,
+										  root->parse->groupClause);
+	}
+	else
+		/* It has aggregation but without a group by, so only one row returned */
+		add_uniquekey_for_onerow(grouprel);
+}
+
+/*
+ * simple_copy_uniquekeys
+ * Using a function for the one-line code makes us easy to check where we simply
+ * copied the uniquekey.
+ */
+void
+simple_copy_uniquekeys(RelOptInfo *oldrel,
+					   RelOptInfo *newrel)
+{
+	newrel->uniquekeys = oldrel->uniquekeys;
+}
+
+/*
+ *  populate_unionrel_uniquekeys
+ */
+void
+populate_unionrel_uniquekeys(PlannerInfo *root,
+							  RelOptInfo *unionrel)
+{
+	ListCell	*lc;
+	List	*exprs = NIL;
+
+	Assert(unionrel->uniquekeys == NIL);
+
+	foreach(lc, unionrel->reltarget->exprs)
+	{
+		exprs = lappend(exprs, lfirst(lc));
+	}
+
+	if (exprs == NIL)
+		/* SQL: select union select; is valid, we need to handle it here. */
+		add_uniquekey_for_onerow(unionrel);
+	else
+		unionrel->uniquekeys = lappend(unionrel->uniquekeys,
+									   makeUniqueKey(exprs,false));
+
+}
+
+/*
+ * populate_joinrel_uniquekeys
+ *
+ * populate uniquekeys for joinrel. We will check each relation to see if its
+ * UniqueKey is still valid via innerrel_keeps_unique, if so, we add it to
+ * joinrel.  The multi_nullvals field will be changed to true for some outer
+ * join cases and one-row UniqueKey needs to be converted to normal UniqueKey
+ * for the same case as well.
+ * For the uniquekey in either baserel which can't be unique after join, we still
+ * check to see if combination of UniqueKeys from both side is still useful for us.
+ * if yes, we add it to joinrel as well.
+ */
+void
+populate_joinrel_uniquekeys(PlannerInfo *root, RelOptInfo *joinrel,
+							RelOptInfo *outerrel, RelOptInfo *innerrel,
+							List *restrictlist, JoinType jointype)
+{
+	ListCell *lc, *lc2;
+	List	*clause_list = NIL;
+	List	*outerrel_ukey_ctx;
+	List	*innerrel_ukey_ctx;
+	bool	inner_onerow, outer_onerow;
+	bool	mergejoin_allowed;
+
+	/* Care about the outerrel relation only for SEMI/ANTI join */
+	if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
+	{
+		foreach(lc, outerrel->uniquekeys)
+		{
+			UniqueKey	*uniquekey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(uniquekey->exprs, joinrel->reltarget->exprs))
+				joinrel->uniquekeys = lappend(joinrel->uniquekeys, uniquekey);
+		}
+		return;
+	}
+
+	Assert(jointype == JOIN_LEFT || jointype == JOIN_FULL || jointype == JOIN_INNER);
+
+	/* Fast path */
+	if (innerrel->uniquekeys == NIL || outerrel->uniquekeys == NIL)
+		return;
+
+	inner_onerow = relation_is_onerow(innerrel);
+	outer_onerow = relation_is_onerow(outerrel);
+
+	outerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(outerrel);
+	innerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(innerrel);
+
+	clause_list = select_mergejoin_clauses(root, joinrel, outerrel, innerrel,
+										   restrictlist, jointype,
+										   &mergejoin_allowed);
+
+	if (innerrel_keeps_unique(root, innerrel, outerrel, clause_list, true /* reverse */))
+	{
+		bool outer_impact = jointype == JOIN_FULL;
+		foreach(lc, outerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			/* Outer relation has one row, and the unique key is not duplicated after join,
+			 * the joinrel will still has one row unless the jointype == JOIN_FULL.
+			 */
+			if (outer_onerow && !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (outer_onerow)
+			{
+				/*
+				 * The onerow outerrel becomes multi rows and multi_nullvals
+				 * will be changed to true. We also need to set the exprs correctly since it
+				 * can't be NIL any more.
+				 */
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(joinrel, outerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Change multi_nullvals to true due to the full join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs, true));
+				else
+					/* Just reuse it */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	if (innerrel_keeps_unique(root, outerrel, innerrel, clause_list, false))
+	{
+		bool outer_impact = jointype == JOIN_FULL || jointype == JOIN_LEFT;;
+
+		foreach(lc, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			if (inner_onerow &&  !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (inner_onerow)
+			{
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(joinrel, innerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Need to change multi_nullvals to true due to the outer join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs,
+																true));
+				else
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	/*
+	 * The combination of the UniqueKey from both sides is unique as well regardless
+	 * of join type, but no bother to add it if its subset has been added to joinrel
+	 * already or it is not useful for the joinrel.
+	 */
+	foreach(lc, outerrel_ukey_ctx)
+	{
+		UniqueKeyContext ctx1 = (UniqueKeyContext) lfirst(lc);
+		if (ctx1->added_to_joinrel || !ctx1->useful)
+			continue;
+		foreach(lc2, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx2 = (UniqueKeyContext) lfirst(lc2);
+			if (ctx2->added_to_joinrel || !ctx2->useful)
+				continue;
+			if (add_combined_uniquekey(joinrel, outerrel, innerrel,
+									   ctx1->uniquekey, ctx2->uniquekey,
+									   jointype))
+				/* If we set a onerow UniqueKey to joinrel, we don't need other. */
+				return;
+		}
+	}
+}
+
+
+/*
+ * convert_subquery_uniquekeys
+ *
+ * Covert the UniqueKey in subquery to outer relation.
+ */
+void convert_subquery_uniquekeys(PlannerInfo *root,
+								 RelOptInfo *currel,
+								 RelOptInfo *sub_final_rel)
+{
+	ListCell	*lc;
+
+	if (sub_final_rel->uniquekeys == NIL)
+		return;
+
+	if (relation_is_onerow(sub_final_rel))
+	{
+		add_uniquekey_for_onerow(currel);
+		return;
+	}
+
+	Assert(currel->subroot != NULL);
+
+	foreach(lc, sub_final_rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		ListCell	*lc;
+		List	*exprs = NIL;
+		bool	ukey_useful = true;
+
+		/* One row case is handled above */
+		Assert(ukey->exprs != NIL);
+		foreach(lc, ukey->exprs)
+		{
+			Var *var;
+			TargetEntry *tle = tlist_member(lfirst(lc),
+											currel->subroot->processed_tlist);
+			if (tle == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			var = find_var_for_subquery_tle(currel, tle);
+			if (var == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			exprs = lappend(exprs, var);
+		}
+
+		if (ukey_useful)
+			currel->uniquekeys = lappend(currel->uniquekeys,
+										 makeUniqueKey(exprs,
+													   ukey->multi_nullvals));
+
+	}
+}
+
+/*
+ * innerrel_keeps_unique
+ *
+ * Check if Unique key of the innerrel is valid after join. innerrel's UniqueKey
+ * will be still valid if innerrel's any-column mergeop outrerel's uniquekey
+ * exists in clause_list.
+ *
+ * Note: the clause_list must be a list of mergeable restrictinfo already.
+ */
+static bool
+innerrel_keeps_unique(PlannerInfo *root,
+					  RelOptInfo *outerrel,
+					  RelOptInfo *innerrel,
+					  List *clause_list,
+					  bool reverse)
+{
+	ListCell	*lc, *lc2, *lc3;
+
+	if (outerrel->uniquekeys == NIL || innerrel->uniquekeys == NIL)
+		return false;
+
+	/* Check if there is outerrel's uniquekey in mergeable clause. */
+	foreach(lc, outerrel->uniquekeys)
+	{
+		List	*outer_uq_exprs = lfirst_node(UniqueKey, lc)->exprs;
+		bool clauselist_matchs_all_exprs = true;
+		foreach(lc2, outer_uq_exprs)
+		{
+			Node *outer_uq_expr = lfirst(lc2);
+			bool find_uq_expr_in_clauselist = false;
+			foreach(lc3, clause_list)
+			{
+				RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc3);
+				Node *outer_expr;
+				if (reverse)
+					outer_expr = rinfo->outer_is_left ? get_rightop(rinfo->clause) : get_leftop(rinfo->clause);
+				else
+					outer_expr = rinfo->outer_is_left ? get_leftop(rinfo->clause) : get_rightop(rinfo->clause);
+				if (equal(outer_expr, outer_uq_expr))
+				{
+					find_uq_expr_in_clauselist = true;
+					break;
+				}
+			}
+			if (!find_uq_expr_in_clauselist)
+			{
+				/* No need to check the next exprs in the current uniquekey */
+				clauselist_matchs_all_exprs = false;
+				break;
+			}
+		}
+
+		if (clauselist_matchs_all_exprs)
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * relation_is_onerow
+ * Check if it is a one-row relation by checking UniqueKey.
+ */
+bool
+relation_is_onerow(RelOptInfo *rel)
+{
+	UniqueKey *ukey;
+	if (rel->uniquekeys == NIL)
+		return false;
+	ukey = linitial_node(UniqueKey, rel->uniquekeys);
+	return ukey->exprs == NIL && list_length(rel->uniquekeys) == 1;
+}
+
+/*
+ * relation_has_uniquekeys_for
+ *		Returns true if we have proofs that 'rel' cannot return multiple rows with
+ *		the same values in each of 'exprs'.  Otherwise returns false.
+ */
+bool
+relation_has_uniquekeys_for(PlannerInfo *root, RelOptInfo *rel,
+							List *exprs, bool allow_multinulls)
+{
+	ListCell *lc;
+
+	/*
+	 * For UniqueKey->onerow case, the uniquekey->exprs is empty as well
+	 * so we can't rely on list_is_subset to handle this special cases
+	 */
+	if (exprs == NIL)
+		return false;
+
+	foreach(lc, rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		if (ukey->multi_nullvals && !allow_multinulls)
+			continue;
+		if (list_is_subset(ukey->exprs, exprs))
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * get_exprs_from_uniqueindex
+ *
+ * Return a list of exprs which is unique. set useful to false if this
+ * unique index is not useful for us.
+ */
+static List *
+get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+						   List *const_exprs,
+						   List *const_expr_opfamilies,
+						   Bitmapset *used_varattrs,
+						   bool *useful,
+						   bool *multi_nullvals)
+{
+	List	*exprs = NIL;
+	ListCell	*indexpr_item;
+	int	c = 0;
+
+	*useful = true;
+	*multi_nullvals = false;
+
+	indexpr_item = list_head(unique_index->indexprs);
+	for(c = 0; c < unique_index->ncolumns; c++)
+	{
+		int attr = unique_index->indexkeys[c];
+		Expr *expr;
+		bool	matched_const = false;
+		ListCell	*lc1, *lc2;
+
+		if(attr > 0)
+		{
+			expr = list_nth_node(TargetEntry, unique_index->indextlist, c)->expr;
+		}
+		else if (attr == 0)
+		{
+			/* Expression index */
+			expr = lfirst(indexpr_item);
+			indexpr_item = lnext(unique_index->indexprs, indexpr_item);
+		}
+		else /* attr < 0 */
+		{
+			/* Index on system column is not supported */
+			Assert(false);
+		}
+
+		/*
+		 * Check index_col = Const case with regarding to opfamily checking
+		 * If we can remove the index_col from the final UniqueKey->exprs.
+		 */
+		forboth(lc1, const_exprs, lc2, const_expr_opfamilies)
+		{
+			if (list_member_oid((List *)lfirst(lc2), unique_index->opfamily[c])
+				&& match_index_to_operand((Node *) lfirst(lc1), c, unique_index))
+			{
+				matched_const = true;
+				break;
+			}
+		}
+
+		if (matched_const)
+			continue;
+
+		/* Check if the indexed expr is used in rel */
+		if (attr > 0)
+		{
+			/*
+			 * Normal Indexed column, if the col is not used, then the index is useless
+			 * for uniquekey.
+			 */
+			attr -= FirstLowInvalidHeapAttributeNumber;
+
+			if (!bms_is_member(attr, used_varattrs))
+			{
+				*useful = false;
+				break;
+			}
+		}
+		else if (!list_member(unique_index->rel->reltarget->exprs, expr))
+		{
+			/* Expression index but the expression is not used in rel */
+			*useful = false;
+			break;
+		}
+
+		/* check not null property. */
+		if (attr == 0)
+		{
+			/* We never know if a expression yields null or not */
+			*multi_nullvals = true;
+		}
+		else if (!bms_is_member(attr, unique_index->rel->notnullattrs)
+				 && !bms_is_member(0 - FirstLowInvalidHeapAttributeNumber,
+								   unique_index->rel->notnullattrs))
+		{
+			*multi_nullvals = true;
+		}
+
+		exprs = lappend(exprs, expr);
+	}
+	return exprs;
+}
+
+
+/*
+ * add_uniquekey_for_onerow
+ * If we are sure that the relation only returns one row, then all the columns
+ * are unique. However we don't need to create UniqueKey for every column, we
+ * just set exprs = NIL and overwrites all the other UniqueKey on this RelOptInfo
+ * since this one has strongest semantics.
+ */
+void
+add_uniquekey_for_onerow(RelOptInfo *rel)
+{
+	/*
+	 * We overwrite the previous UniqueKey on purpose since this one has the
+	 * strongest semantic.
+	 */
+	rel->uniquekeys = list_make1(makeUniqueKey(NIL, false));
+}
+
+
+/*
+ * initililze_uniquecontext_for_joinrel
+ * Return a List of UniqueKeyContext for an inputrel
+ */
+static List *
+initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel)
+{
+	List	*res = NIL;
+	ListCell *lc;
+	foreach(lc,  inputrel->uniquekeys)
+	{
+		UniqueKeyContext context;
+		context = palloc(sizeof(struct UniqueKeyContextData));
+		context->uniquekey = lfirst_node(UniqueKey, lc);
+		context->added_to_joinrel = false;
+		context->useful = true;
+		res = lappend(res, context);
+	}
+	return res;
+}
+
+
+/*
+ * get_exprs_from_uniquekey
+ *	Unify the way of get List of exprs from a one-row UniqueKey or
+ * normal UniqueKey. for the onerow case, every expr in rel1 is a valid
+ * UniqueKey. Return a List of exprs.
+ *
+ * rel1: The relation which you want to get the exprs.
+ * ukey: The UniqueKey you want to get the exprs.
+ */
+static List *
+get_exprs_from_uniquekey(RelOptInfo *joinrel, RelOptInfo *rel1, UniqueKey *ukey)
+{
+	ListCell *lc;
+	bool onerow = rel1 != NULL && relation_is_onerow(rel1);
+
+	List	*res = NIL;
+	Assert(onerow || ukey);
+	if (onerow)
+	{
+		/* Only cares about the exprs still exist in joinrel */
+		foreach(lc, joinrel->reltarget->exprs)
+		{
+			Bitmapset *relids = pull_varnos(lfirst(lc));
+			if (bms_is_subset(relids, rel1->relids))
+			{
+				res = lappend(res, list_make1(lfirst(lc)));
+			}
+		}
+	}
+	else
+	{
+		res = list_make1(ukey->exprs);
+	}
+	return res;
+}
+
+/*
+ * Partitioned table Unique Keys.
+ * The partition table unique key is maintained as:
+ * 1. The index must be unique as usual.
+ * 2. The index must contains partition key.
+ * 3. The index must exist on all the child rel. see simple_indexinfo_equal for
+ *    how we compare it.
+ */
+
+/*
+ * index_constains_partkey
+ * return true if the index contains the partiton key.
+ */
+static bool
+index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind)
+{
+	ListCell	*lc;
+	int	i;
+	Assert(IS_PARTITIONED_REL(partrel));
+	Assert(partrel->part_scheme->partnatts > 0);
+
+	for(i = 0; i < partrel->part_scheme->partnatts; i++)
+	{
+		Node *part_expr = linitial(partrel->partexprs[i]);
+		bool found_in_index = false;
+		foreach(lc, ind->indextlist)
+		{
+			Expr *index_expr = lfirst_node(TargetEntry, lc)->expr;
+			if (equal(index_expr, part_expr))
+			{
+				found_in_index = true;
+				break;
+			}
+		}
+		if (!found_in_index)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * simple_indexinfo_equal
+ *
+ * Used to check if the 2 index is same as each other. The index here
+ * is COPIED from childrel and did some tiny changes(see
+ * simple_copy_indexinfo_to_parent)
+ */
+static bool
+simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2)
+{
+	Size oid_cmp_len = sizeof(Oid) * ind1->ncolumns;
+
+	return ind1->ncolumns == ind2->ncolumns &&
+		ind1->unique == ind2->unique &&
+		memcmp(ind1->indexkeys, ind2->indexkeys, sizeof(int) * ind1->ncolumns) == 0 &&
+		memcmp(ind1->opfamily, ind2->opfamily, oid_cmp_len) == 0 &&
+		memcmp(ind1->opcintype, ind2->opcintype, oid_cmp_len) == 0 &&
+		memcmp(ind1->sortopfamily, ind2->sortopfamily, oid_cmp_len) == 0 &&
+		equal(get_tlist_exprs(ind1->indextlist, true),
+			  get_tlist_exprs(ind2->indextlist, true));
+}
+
+
+/*
+ * The below macros are used for simple_copy_indexinfo_to_parent which is so
+ * customized that I don't want to put it to copyfuncs.c. So copy it here.
+ */
+#define COPY_POINTER_FIELD(fldname, sz) \
+	do { \
+		Size	_size = (sz); \
+		newnode->fldname = palloc(_size); \
+		memcpy(newnode->fldname, from->fldname, _size); \
+	} while (0)
+
+#define COPY_NODE_FIELD(fldname) \
+	(newnode->fldname = copyObjectImpl(from->fldname))
+
+#define COPY_SCALAR_FIELD(fldname) \
+	(newnode->fldname = from->fldname)
+
+
+/*
+ * simple_copy_indexinfo_to_parent (from partition)
+ * Copy the IndexInfo from child relation to parent relation with some modification,
+ * which is used to test:
+ * 1. If the same index exists in all the childrels.
+ * 2. If the parentrel->reltarget/basicrestrict info matches this index.
+ */
+static IndexOptInfo *
+simple_copy_indexinfo_to_parent(PlannerInfo *root,
+								RelOptInfo *parentrel,
+								IndexOptInfo *from)
+{
+	IndexOptInfo *newnode = makeNode(IndexOptInfo);
+	AppendRelInfo *appinfo = find_appinfo_by_child(root, from->rel->relid);
+	ListCell	*lc;
+	int	idx = 0;
+
+	COPY_SCALAR_FIELD(ncolumns);
+	COPY_SCALAR_FIELD(nkeycolumns);
+	COPY_SCALAR_FIELD(unique);
+	COPY_SCALAR_FIELD(immediate);
+	/* We just need to know if it is NIL or not */
+	COPY_SCALAR_FIELD(indpred);
+	COPY_SCALAR_FIELD(predOK);
+	COPY_POINTER_FIELD(indexkeys, from->ncolumns * sizeof(int));
+	COPY_POINTER_FIELD(indexcollations, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opfamily, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opcintype, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(sortopfamily, from->ncolumns * sizeof(Oid));
+	COPY_NODE_FIELD(indextlist);
+
+	/* Convert index exprs on child expr to expr on parent */
+	foreach(lc, newnode->indextlist)
+	{
+		TargetEntry *tle = lfirst_node(TargetEntry, lc);
+		/* Index on expression is ignored */
+		Assert(IsA(tle->expr, Var));
+		tle->expr = (Expr *) find_parent_var(appinfo, (Var *) tle->expr);
+		newnode->indexkeys[idx] = castNode(Var, tle->expr)->varattno;
+		idx++;
+	}
+	newnode->rel = parentrel;
+	return newnode;
+}
+
+/*
+ * adjust_partition_unique_indexlist
+ *
+ * global_unique_indexes: At the beginning, it contains the copy & modified
+ * unique index from the first partition. And then check if each index in it still
+ * exists in the following partitions. If no, remove it. at last, it has an
+ * index list which exists in all the partitions.
+ */
+static void
+adjust_partition_unique_indexlist(PlannerInfo *root,
+								  RelOptInfo *parentrel,
+								  RelOptInfo *childrel,
+								  List **global_unique_indexes)
+{
+	ListCell	*lc, *lc2;
+	foreach(lc, *global_unique_indexes)
+	{
+		IndexOptInfo	*g_ind = lfirst_node(IndexOptInfo, lc);
+		bool found_in_child = false;
+
+		foreach(lc2, childrel->indexlist)
+		{
+			IndexOptInfo   *p_ind = lfirst_node(IndexOptInfo, lc2);
+			IndexOptInfo   *p_ind_copy;
+			if (!p_ind->unique || !p_ind->immediate ||
+				(p_ind->indpred != NIL && !p_ind->predOK))
+				continue;
+			p_ind_copy = simple_copy_indexinfo_to_parent(root, parentrel, p_ind);
+			if (simple_indexinfo_equal(p_ind_copy, g_ind))
+			{
+				found_in_child = true;
+				break;
+			}
+		}
+		if (!found_in_child)
+			/* The index doesn't exist in childrel, remove it from global_unique_indexes */
+			*global_unique_indexes = foreach_delete_current(*global_unique_indexes, lc);
+	}
+}
+
+/* Helper function for groupres/distinctrel */
+static void
+add_uniquekey_from_sortgroups(PlannerInfo *root, RelOptInfo *rel, List *sortgroups)
+{
+	Query *parse = root->parse;
+	List	*exprs;
+
+	/*
+	 * XXX: If there are some vars which is not in current levelsup, the semantic is
+	 * imprecise, should we avoid it or not? levelsup = 1 is just a demo, maybe we need to
+	 * check every level other than 0, if so, looks we have to write another
+	 * pull_var_walker.
+	 */
+	List	*upper_vars = pull_vars_of_level((Node*)sortgroups, 1);
+
+	if (upper_vars != NIL)
+		return;
+
+	exprs = get_sortgrouplist_exprs(sortgroups, parse->targetList);
+	rel->uniquekeys = lappend(rel->uniquekeys,
+							  makeUniqueKey(exprs,
+											false /* sortgroupclause can't be multi_nullvals */));
+}
+
+
+/*
+ * add_combined_uniquekey
+ * The combination of both UniqueKeys is a valid UniqueKey for joinrel no matter
+ * the jointype.
+ */
+bool
+add_combined_uniquekey(RelOptInfo *joinrel,
+					   RelOptInfo *outer_rel,
+					   RelOptInfo *inner_rel,
+					   UniqueKey *outer_ukey,
+					   UniqueKey *inner_ukey,
+					   JoinType jointype)
+{
+
+	ListCell	*lc1, *lc2;
+
+	/* Either side has multi_nullvals or we have outer join,
+	 * the combined UniqueKey has multi_nullvals */
+	bool multi_nullvals = outer_ukey->multi_nullvals ||
+		inner_ukey->multi_nullvals || IS_OUTER_JOIN(jointype);
+
+	/* The only case we can get onerow joinrel after join */
+	if  (relation_is_onerow(outer_rel)
+		 && relation_is_onerow(inner_rel)
+		 && jointype == JOIN_INNER)
+	{
+		add_uniquekey_for_onerow(joinrel);
+		return true;
+	}
+
+	foreach(lc1, get_exprs_from_uniquekey(joinrel, outer_rel, outer_ukey))
+	{
+		foreach(lc2, get_exprs_from_uniquekey(joinrel, inner_rel, inner_ukey))
+		{
+			List *exprs = list_concat_copy(lfirst_node(List, lc1), lfirst_node(List, lc2));
+			joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+										  makeUniqueKey(exprs,
+														multi_nullvals));
+		}
+	}
+	return false;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b406d41e91..0551ae0512 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2389,6 +2389,8 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 		add_path(final_rel, path);
 	}
 
+	simple_copy_uniquekeys(current_rel, final_rel);
+
 	/*
 	 * Generate partial paths for final_rel, too, if outer query levels might
 	 * be able to make use of them.
@@ -3899,6 +3901,8 @@ create_grouping_paths(PlannerInfo *root,
 	}
 
 	set_cheapest(grouped_rel);
+
+	populate_grouprel_uniquekeys(root, grouped_rel, input_rel);
 	return grouped_rel;
 }
 
@@ -4615,7 +4619,7 @@ create_window_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(window_rel);
-
+	simple_copy_uniquekeys(input_rel, window_rel);
 	return window_rel;
 }
 
@@ -4911,7 +4915,7 @@ create_distinct_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(distinct_rel);
-
+	populate_distinctrel_uniquekeys(root, input_rel, distinct_rel);
 	return distinct_rel;
 }
 
@@ -5172,6 +5176,8 @@ create_ordered_paths(PlannerInfo *root,
 	 */
 	Assert(ordered_rel->pathlist != NIL);
 
+	simple_copy_uniquekeys(input_rel, ordered_rel);
+
 	return ordered_rel;
 }
 
@@ -6049,6 +6055,9 @@ adjust_paths_for_srfs(PlannerInfo *root, RelOptInfo *rel,
 	if (list_length(targets) == 1)
 		return;
 
+	/* UniqueKey is not valid after handling the SRF. */
+	rel->uniquekeys = NIL;
+
 	/*
 	 * Stack SRF-evaluation nodes atop each path for the rel.
 	 *
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 951aed80e7..e94e92937c 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -689,6 +689,8 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/* Undo effects of possibly forcing tuple_fraction to 0 */
 	root->tuple_fraction = save_fraction;
 
+	/* Add the UniqueKeys */
+	populate_unionrel_uniquekeys(root, result_rel);
 	return result_rel;
 }
 
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index d722063cf3..44c37ecffc 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -746,3 +746,47 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
 	}
 	return appinfos;
 }
+
+/*
+ * find_appinfo_by_child
+ *
+ */
+AppendRelInfo *
+find_appinfo_by_child(PlannerInfo *root, Index child_index)
+{
+	ListCell	*lc;
+	foreach(lc, root->append_rel_list)
+	{
+		AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
+		if (appinfo->child_relid == child_index)
+			return appinfo;
+	}
+	elog(ERROR, "parent relation cant be found");
+	return NULL;
+}
+
+/*
+ * find_parent_var
+ *
+ */
+Var *
+find_parent_var(AppendRelInfo *appinfo, Var *child_var)
+{
+	ListCell	*lc;
+	Var	*res = NULL;
+	Index attno = 1;
+	foreach(lc, appinfo->translated_vars)
+	{
+		Node *child_node = lfirst(lc);
+		if (equal(child_node, child_var))
+		{
+			res = copyObject(child_var);
+			res->varattno = attno;
+			res->varno = appinfo->parent_relid;
+		}
+		attno++;
+	}
+	if (res == NULL)
+		elog(ERROR, "parent var can't be found.");
+	return res;
+}
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 3132fd35a5..d66b40ec50 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -736,6 +736,7 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 		{
 			Node	   *onecq = (Node *) lfirst(lc2);
 			bool		pseudoconstant;
+			RestrictInfo	*child_rinfo;
 
 			/* check for pseudoconstant (no Vars or volatile functions) */
 			pseudoconstant =
@@ -747,13 +748,14 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 				root->hasPseudoConstantQuals = true;
 			}
 			/* reconstitute RestrictInfo with appropriate properties */
-			childquals = lappend(childquals,
-								 make_restrictinfo((Expr *) onecq,
-												   rinfo->is_pushed_down,
-												   rinfo->outerjoin_delayed,
-												   pseudoconstant,
-												   rinfo->security_level,
-												   NULL, NULL, NULL));
+			child_rinfo =  make_restrictinfo((Expr *) onecq,
+											 rinfo->is_pushed_down,
+											 rinfo->outerjoin_delayed,
+											 pseudoconstant,
+											 rinfo->security_level,
+											 NULL, NULL, NULL);
+			child_rinfo->mergeopfamilies = rinfo->mergeopfamilies;
+			childquals = lappend(childquals, child_rinfo);
 			/* track minimum security level among child quals */
 			cq_min_security = Min(cq_min_security, rinfo->security_level);
 		}
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 31d9aedeeb..c83f17acb7 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -16,6 +16,7 @@
 
 #include "nodes/execnodes.h"
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 
 extern A_Expr *makeA_Expr(A_Expr_Kind kind, List *name,
@@ -105,4 +106,6 @@ extern GroupingSet *makeGroupingSet(GroupingSetKind kind, List *content, int loc
 
 extern VacuumRelation *makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols);
 
+extern UniqueKey* makeUniqueKey(List *exprs, bool multi_nullvals);
+
 #endif							/* MAKEFUNC_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 381d84b4e4..41110ed888 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -264,6 +264,7 @@ typedef enum NodeTag
 	T_EquivalenceMember,
 	T_PathKey,
 	T_PathTarget,
+	T_UniqueKey,
 	T_RestrictInfo,
 	T_IndexClause,
 	T_PlaceHolderVar,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 9e3ebd488a..02e4458bef 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -730,6 +730,7 @@ typedef struct RelOptInfo
 	QualCost	baserestrictcost;	/* cost of evaluating the above */
 	Index		baserestrict_min_security;	/* min security_level found in
 											 * baserestrictinfo */
+	List	   *uniquekeys;		/* List of UniqueKey */
 	List	   *joininfo;		/* RestrictInfo structures for join clauses
 								 * involving this rel */
 	bool		has_eclass_joins;	/* T means joininfo is incomplete */
@@ -1047,6 +1048,28 @@ typedef struct PathKey
 } PathKey;
 
 
+/*
+ * UniqueKey
+ *
+ * Represents the unique properties held by a RelOptInfo.
+ *
+ * exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+ * is a special case of UniqueKey, which means there is only 1 row in that
+ * relation.
+ * multi_nullvals: true means multi null values may exist in these exprs, so the
+ * uniqueness is not guaranteed in this case. This field is necessary for
+ * remove_useless_join & reduce_unique_semijoins where we don't mind these
+ * duplicated NULL values. It is set to true for 2 cases. One is a unique key
+ * from a unique index but the related column is nullable. The other one is for
+ * outer join. see populate_joinrel_uniquekeys for detail.
+ */
+typedef struct UniqueKey
+{
+	NodeTag		type;
+	List	   *exprs;
+	bool		multi_nullvals;
+} UniqueKey;
+
 /*
  * PathTarget
  *
@@ -2473,7 +2496,7 @@ typedef enum
  *
  * flags indicating what kinds of grouping are possible.
  * partial_costs_set is true if the agg_partial_costs and agg_final_costs
- * 		have been initialized.
+ *		have been initialized.
  * agg_partial_costs gives partial aggregation costs.
  * agg_final_costs gives finalization costs.
  * target_parallel_safe is true if target is parallel safe.
@@ -2503,8 +2526,8 @@ typedef struct
  * limit_tuples is an estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate.
  * count_est and offset_est are the estimated values of the LIMIT and OFFSET
- * 		expressions computed by preprocess_limit() (see comments for
- * 		preprocess_limit() for more information).
+ *		expressions computed by preprocess_limit() (see comments for
+ *		preprocess_limit() for more information).
  */
 typedef struct
 {
diff --git a/src/include/nodes/pg_list.h b/src/include/nodes/pg_list.h
index 14ea2766ad..621f54a9f8 100644
--- a/src/include/nodes/pg_list.h
+++ b/src/include/nodes/pg_list.h
@@ -528,6 +528,8 @@ extern bool list_member_ptr(const List *list, const void *datum);
 extern bool list_member_int(const List *list, int datum);
 extern bool list_member_oid(const List *list, Oid datum);
 
+extern bool list_is_subset(const List *members, const List *target);
+
 extern List *list_delete(List *list, void *datum);
 extern List *list_delete_ptr(List *list, void *datum);
 extern List *list_delete_int(List *list, int datum);
diff --git a/src/include/optimizer/appendinfo.h b/src/include/optimizer/appendinfo.h
index d6a27a60dd..e87c92a054 100644
--- a/src/include/optimizer/appendinfo.h
+++ b/src/include/optimizer/appendinfo.h
@@ -32,4 +32,7 @@ extern Relids adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
 extern AppendRelInfo **find_appinfos_by_relids(PlannerInfo *root,
 											   Relids relids, int *nappinfos);
 
+extern AppendRelInfo *find_appinfo_by_child(PlannerInfo *root, Index child_index);
+extern Var *find_parent_var(AppendRelInfo *appinfo, Var *child_var);
+
 #endif							/* APPENDINFO_H */
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index 3e4171056e..9445141263 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -23,6 +23,7 @@
 #define OPTIMIZER_H
 
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 /*
  * We don't want to include nodes/pathnodes.h here, because non-planner
@@ -156,6 +157,7 @@ extern TargetEntry *get_sortgroupref_tle(Index sortref,
 										 List *targetList);
 extern TargetEntry *get_sortgroupclause_tle(SortGroupClause *sgClause,
 											List *targetList);
+extern Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 extern Node *get_sortgroupclause_expr(SortGroupClause *sgClause,
 									  List *targetList);
 extern List *get_sortgrouplist_exprs(List *sgClauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 10b6e81079..9217a8d6c6 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -240,5 +240,48 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 									   int strategy, bool nulls_first);
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 									List *live_childrels);
+extern List *select_mergejoin_clauses(PlannerInfo *root,
+									  RelOptInfo *joinrel,
+									  RelOptInfo *outerrel,
+									  RelOptInfo *innerrel,
+									  List *restrictlist,
+									  JoinType jointype,
+									  bool *mergejoin_allowed);
+
+/*
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ */
+extern void populate_baserel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *baserel,
+										List* unique_index_list);
+extern void populate_partitionedrel_uniquekeys(PlannerInfo *root,
+												RelOptInfo *rel,
+												List *childrels);
+extern void populate_distinctrel_uniquekeys(PlannerInfo *root,
+											RelOptInfo *inputrel,
+											RelOptInfo *distinctrel);
+extern void populate_grouprel_uniquekeys(PlannerInfo *root,
+										 RelOptInfo *grouprel,
+										 RelOptInfo *inputrel);
+extern void populate_unionrel_uniquekeys(PlannerInfo *root,
+										  RelOptInfo *unionrel);
+extern void simple_copy_uniquekeys(RelOptInfo *oldrel,
+								   RelOptInfo *newrel);
+extern void convert_subquery_uniquekeys(PlannerInfo *root,
+										RelOptInfo *currel,
+										RelOptInfo *sub_final_rel);
+extern void populate_joinrel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *joinrel,
+										RelOptInfo *rel1,
+										RelOptInfo *rel2,
+										List *restrictlist,
+										JoinType jointype);
+
+extern bool relation_has_uniquekeys_for(PlannerInfo *root,
+										RelOptInfo *rel,
+										List *exprs,
+										bool allow_multinulls);
+extern bool relation_is_onerow(RelOptInfo *rel);
 
 #endif							/* PATHS_H */
-- 
2.27.0

v01-0001-Extend-UniqueKeys.patchapplication/octet-stream; name=v01-0001-Extend-UniqueKeys.patchDownload

From d076b1633e56007bf63456053470daf441d850d0 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:33:56 +0200
Subject: [PATCH 3/5] Extend UniqueKeys

Prepares index skip scan implementation using UniqueKeys. Allows to
specify what are the "requested" keys that should be unique, and add
them to necessary Paths to make them useful later.

Proposed by David Rowley, contains few bits out of previous version from
Jesper Pedersen.
---
 src/backend/optimizer/path/pathkeys.c   | 59 +++++++++++++++++++++++
 src/backend/optimizer/path/uniquekeys.c | 63 +++++++++++++++++++++++++
 src/backend/optimizer/plan/planner.c    | 36 +++++++++++++-
 src/backend/optimizer/util/pathnode.c   | 32 +++++++++----
 src/include/nodes/pathnodes.h           |  5 ++
 src/include/optimizer/pathnode.h        |  1 +
 src/include/optimizer/paths.h           |  8 ++++
 7 files changed, 194 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 7e596d4194..a4fc4f252d 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
 #include "utils/lsyscache.h"
 
 
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
@@ -95,6 +96,29 @@ make_canonical_pathkey(PlannerInfo *root,
 	return pk;
 }
 
+/*
+ * pathkey_is_unique
+ *	   Checks if the new pathkey's equivalence class is the same as that of
+ *     any existing member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+	EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+	ListCell   *lc;
+
+	/* If same EC already is already in the list, then not unique */
+	foreach(lc, pathkeys)
+	{
+		PathKey    *old_pathkey = (PathKey *) lfirst(lc);
+
+		if (new_ec == old_pathkey->pk_eclass)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * pathkey_is_redundant
  *	   Is a pathkey redundant with one already in the given list?
@@ -1151,6 +1175,41 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
 	return pathkeys;
 }
 
+/*
+ * make_pathkeys_for_uniquekeyclauses
+ *		Generate a pathkeys list to be used for uniquekey clauses
+ */
+List *
+make_pathkeys_for_uniquekeys(PlannerInfo *root,
+							 List *sortclauses,
+							 List *tlist)
+{
+	List	   *pathkeys = NIL;
+	ListCell   *l;
+
+	foreach(l, sortclauses)
+	{
+		SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+		Expr	   *sortkey;
+		PathKey    *pathkey;
+
+		sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+		Assert(OidIsValid(sortcl->sortop));
+		pathkey = make_pathkey_from_sortop(root,
+										   sortkey,
+										   root->nullable_baserels,
+										   sortcl->sortop,
+										   sortcl->nulls_first,
+										   sortcl->tleSortGroupRef,
+										   true);
+
+		if (pathkey_is_unique(pathkey, pathkeys))
+			pathkeys = lappend(pathkeys, pathkey);
+	}
+
+	return pathkeys;
+}
+
 /****************************************************************************
  *		PATHKEYS AND MERGECLAUSES
  ****************************************************************************/
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
index b33bcd2f32..4bc16ea023 100644
--- a/src/backend/optimizer/path/uniquekeys.c
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -1129,3 +1129,66 @@ add_combined_uniquekey(RelOptInfo *joinrel,
 	}
 	return false;
 }
+
+List*
+build_uniquekeys(PlannerInfo *root, List *sortclauses)
+{
+	List *result = NIL;
+	List *sortkeys;
+	ListCell *l;
+	List *exprs = NIL;
+
+	sortkeys = make_pathkeys_for_uniquekeys(root,
+											sortclauses,
+											root->processed_tlist);
+
+	/* Create a uniquekey and add it to the list */
+	foreach(l, sortkeys)
+	{
+		PathKey    *pathkey = (PathKey *) lfirst(l);
+		EquivalenceClass *ec = pathkey->pk_eclass;
+		EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members));
+		if (EC_MUST_BE_REDUNDANT(ec))
+			continue;
+		exprs = lappend(exprs, mem->em_expr);
+	}
+
+	result = lappend(result, makeUniqueKey(exprs, false));
+
+	return result;
+}
+
+bool
+query_has_uniquekeys_for(PlannerInfo *root, List *pathuniquekeys,
+						 bool allow_multinulls)
+{
+	ListCell *lc;
+	ListCell *lc2;
+
+	/* root->query_uniquekeys are the requested DISTINCT clauses on query level
+	 * pathuniquekeys are the unique keys on current path.
+	 * All requested query_uniquekeys must be satisfied by the pathuniquekeys
+	 */
+	foreach(lc, root->query_uniquekeys)
+	{
+		UniqueKey *query_ukey = lfirst_node(UniqueKey, lc);
+		bool satisfied = false;
+		foreach(lc2, pathuniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc2);
+			if (ukey->multi_nullvals && !allow_multinulls)
+				continue;
+			if (list_length(ukey->exprs) == 0 &&
+				list_length(query_ukey->exprs) != 0)
+				continue;
+			if (list_is_subset(ukey->exprs, query_ukey->exprs))
+			{
+				satisfied = true;
+				break;
+			}
+		}
+		if (!satisfied)
+			return false;
+	}
+	return true;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 0551ae0512..58386d5040 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3606,12 +3606,18 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 	 */
 	if (qp_extra->groupClause &&
 		grouping_is_sortable(qp_extra->groupClause))
+	{
 		root->group_pathkeys =
 			make_pathkeys_for_sortclauses(root,
 										  qp_extra->groupClause,
 										  tlist);
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
+	{
 		root->group_pathkeys = NIL;
+		root->query_uniquekeys = NIL;
+	}
 
 	/* We consider only the first (bottom) window in pathkeys logic */
 	if (activeWindows != NIL)
@@ -4815,13 +4821,19 @@ create_distinct_paths(PlannerInfo *root,
 			Path	   *path = (Path *) lfirst(lc);
 
 			if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
-			{
 				add_path(distinct_rel, (Path *)
 						 create_upper_unique_path(root, distinct_rel,
 												  path,
 												  list_length(root->distinct_pathkeys),
 												  numDistinctRows));
-			}
+		}
+
+		foreach(lc, input_rel->unique_pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+
+			if (query_has_uniquekeys_for(root, needed_pathkeys, false))
+				add_path(distinct_rel, path);
 		}
 
 		/* For explicit-sort case, always use the more rigorous clause */
@@ -7517,6 +7529,26 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		}
 	}
 
+	foreach(lc, rel->unique_pathlist)
+	{
+		Path	   *subpath = (Path *) lfirst(lc);
+
+		/* Shouldn't have any parameterized paths anymore */
+		Assert(subpath->param_info == NULL);
+
+		if (tlist_same_exprs)
+			subpath->pathtarget->sortgrouprefs =
+				scanjoin_target->sortgrouprefs;
+		else
+		{
+			Path	   *newpath;
+
+			newpath = (Path *) create_projection_path(root, rel, subpath,
+													  scanjoin_target);
+			lfirst(lc) = newpath;
+		}
+	}
+
 	/*
 	 * Now, if final scan/join target contains SRFs, insert ProjectSetPath(s)
 	 * atop each existing path.  (Note that this function doesn't look at the
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 5110a6b806..d4abb3cb47 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -416,10 +416,10 @@ set_cheapest(RelOptInfo *parent_rel)
  * 'parent_rel' is the relation entry to which the path corresponds.
  * 'new_path' is a potential path for parent_rel.
  *
- * Returns nothing, but modifies parent_rel->pathlist.
+ * Returns modified pathlist.
  */
-void
-add_path(RelOptInfo *parent_rel, Path *new_path)
+static List *
+add_path_to(RelOptInfo *parent_rel, List *pathlist, Path *new_path)
 {
 	bool		accept_new = true;	/* unless we find a superior old path */
 	int			insert_at = 0;	/* where to insert new item */
@@ -440,7 +440,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	 * for more than one old path to be tossed out because new_path dominates
 	 * it.
 	 */
-	foreach(p1, parent_rel->pathlist)
+	foreach(p1, pathlist)
 	{
 		Path	   *old_path = (Path *) lfirst(p1);
 		bool		remove_old = false; /* unless new proves superior */
@@ -584,8 +584,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		 */
 		if (remove_old)
 		{
-			parent_rel->pathlist = foreach_delete_current(parent_rel->pathlist,
-														  p1);
+			pathlist = foreach_delete_current(pathlist, p1);
 
 			/*
 			 * Delete the data pointed-to by the deleted cell, if possible
@@ -612,8 +611,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	if (accept_new)
 	{
 		/* Accept the new path: insert it at proper place in pathlist */
-		parent_rel->pathlist =
-			list_insert_nth(parent_rel->pathlist, insert_at, new_path);
+		pathlist = list_insert_nth(pathlist, insert_at, new_path);
 	}
 	else
 	{
@@ -621,6 +619,23 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		if (!IsA(new_path, IndexPath))
 			pfree(new_path);
 	}
+
+	return pathlist;
+}
+
+void
+add_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->pathlist = add_path_to(parent_rel,
+									   parent_rel->pathlist, new_path);
+}
+
+void
+add_unique_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->unique_pathlist = add_path_to(parent_rel,
+											  parent_rel->unique_pathlist,
+											  new_path);
 }
 
 /*
@@ -2571,6 +2586,7 @@ create_projection_path(PlannerInfo *root,
 	pathnode->path.pathkeys = subpath->pathkeys;
 
 	pathnode->subpath = subpath;
+	pathnode->path.uniquekeys = subpath->uniquekeys;
 
 	/*
 	 * We might not need a separate Result node.  If the input plan node type
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 02e4458bef..a5c406bd4e 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -297,6 +297,7 @@ struct PlannerInfo
 
 	List	   *query_pathkeys; /* desired pathkeys for query_planner() */
 
+	List	   *query_uniquekeys; /* unique keys required for the query */
 	List	   *group_pathkeys; /* groupClause pathkeys, if any */
 	List	   *window_pathkeys;	/* pathkeys of bottom window, if any */
 	List	   *distinct_pathkeys;	/* distinctClause pathkeys, if any */
@@ -679,6 +680,7 @@ typedef struct RelOptInfo
 	List	   *pathlist;		/* Path structures */
 	List	   *ppilist;		/* ParamPathInfos used in pathlist */
 	List	   *partial_pathlist;	/* partial Paths */
+	List       *unique_pathlist;    /* unique Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
 	struct Path *cheapest_unique_path;
@@ -866,6 +868,7 @@ struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanskip;		/* can AM skip duplicate values? */
 	bool		amcanparallel;	/* does AM support parallel scan? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
 	void		(*amcostestimate) ();	/* AM's cost estimator */
@@ -1182,6 +1185,8 @@ typedef struct Path
 
 	List	   *pathkeys;		/* sort ordering of path's output */
 	/* pathkeys is a List of PathKey nodes; see above */
+
+	List	   *uniquekeys;	/* the unique keys, or NIL if none */
 } Path;
 
 /* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 715a24ad29..6796ad8cb7 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -27,6 +27,7 @@ extern int	compare_fractional_path_costs(Path *path1, Path *path2,
 										  double fraction);
 extern void set_cheapest(RelOptInfo *parent_rel);
 extern void add_path(RelOptInfo *parent_rel, Path *new_path);
+extern void add_unique_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_path_precheck(RelOptInfo *parent_rel,
 							  Cost startup_cost, Cost total_cost,
 							  List *pathkeys, Relids required_outer);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9217a8d6c6..0cb8030e33 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -215,6 +215,9 @@ extern List *build_join_pathkeys(PlannerInfo *root,
 extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
 										   List *sortclauses,
 										   List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+										  List *sortclauses,
+										  List *tlist);
 extern void initialize_mergeclause_eclasses(PlannerInfo *root,
 											RestrictInfo *restrictinfo);
 extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -282,6 +285,11 @@ extern bool relation_has_uniquekeys_for(PlannerInfo *root,
 										RelOptInfo *rel,
 										List *exprs,
 										bool allow_multinulls);
+extern bool query_has_uniquekeys_for(PlannerInfo *root,
+									 List *exprs,
+									 bool allow_multinulls);
 extern bool relation_is_onerow(RelOptInfo *rel);
 
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+
 #endif							/* PATHS_H */
-- 
2.27.0

v01-0002-Index-skip-scan.patchapplication/octet-stream; name=v01-0002-Index-skip-scan.patchDownload

From 00e5f16a642c9715d3d8d568e0b15058b5a9e814 Mon Sep 17 00:00:00 2001
From: Floris van Nee <floris.vannee@gmail.com>
Date: Fri, 15 Nov 2019 09:46:53 -0500
Subject: [PATCH 4/5] Index skip scan

Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
as part of the IndexOnlyScan, IndexScan and BitmapIndexScan for nbtree.
This patch improves performance of two main types of queries significantly:
- SELECT DISTINCT, SELECT DISTINCT ON
- Regular SELECTs with WHERE-clauses on non-leading index attributes
For example, given an nbtree index on three columns (a,b,c), the following queries
may now be significantly faster:
- SELECT DISTINCT ON (a) * FROM t1
- SELECT * FROM t1 WHERE b=2
- SELECT * FROM t1 WHERE b IN (10,40)
- SELECT DISTINCT ON (a,b) * FROM t1 WHERE c BETWEEN 10 AND 100 ORDER BY a,b,c

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen. Further enhanced functionality
added by Floris van Nee regarding a more general and performant skip implementation.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com

Author: Floris van Nee, Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 contrib/amcheck/verify_nbtree.c               |    4 +-
 contrib/bloom/blutils.c                       |    3 +
 doc/src/sgml/config.sgml                      |   15 +
 doc/src/sgml/indexam.sgml                     |  121 +-
 doc/src/sgml/indices.sgml                     |   28 +
 src/backend/access/brin/brin.c                |    3 +
 src/backend/access/gin/ginutil.c              |    3 +
 src/backend/access/gist/gist.c                |    3 +
 src/backend/access/hash/hash.c                |    3 +
 src/backend/access/index/indexam.c            |  163 ++
 src/backend/access/nbtree/Makefile            |    1 +
 src/backend/access/nbtree/nbtinsert.c         |    2 +-
 src/backend/access/nbtree/nbtpage.c           |    2 +-
 src/backend/access/nbtree/nbtree.c            |   56 +-
 src/backend/access/nbtree/nbtsearch.c         |  788 ++++------
 src/backend/access/nbtree/nbtskip.c           | 1317 +++++++++++++++++
 src/backend/access/nbtree/nbtsort.c           |    2 +-
 src/backend/access/nbtree/nbtutils.c          |  821 +++++++++-
 src/backend/access/spgist/spgutils.c          |    3 +
 src/backend/commands/explain.c                |   29 +
 src/backend/executor/execScan.c               |   35 +-
 src/backend/executor/nodeBitmapIndexscan.c    |   21 +-
 src/backend/executor/nodeIndexonlyscan.c      |   69 +-
 src/backend/executor/nodeIndexscan.c          |   71 +-
 src/backend/nodes/copyfuncs.c                 |    5 +
 src/backend/nodes/outfuncs.c                  |    6 +
 src/backend/nodes/readfuncs.c                 |    5 +
 src/backend/optimizer/path/costsize.c         |    1 +
 src/backend/optimizer/path/indxpath.c         |   68 +
 src/backend/optimizer/path/pathkeys.c         |   72 +
 src/backend/optimizer/plan/createplan.c       |   38 +-
 src/backend/optimizer/plan/planner.c          |    8 +-
 src/backend/optimizer/util/pathnode.c         |   38 +
 src/backend/optimizer/util/plancat.c          |    3 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 src/backend/utils/sort/tuplesort.c            |    4 +-
 src/include/access/amapi.h                    |   19 +
 src/include/access/genam.h                    |   16 +
 src/include/access/nbtree.h                   |  140 +-
 src/include/executor/executor.h               |    4 +
 src/include/nodes/execnodes.h                 |    7 +
 src/include/nodes/pathnodes.h                 |    5 +
 src/include/nodes/plannodes.h                 |    5 +
 src/include/optimizer/cost.h                  |    1 +
 src/include/optimizer/pathnode.h              |    4 +
 src/include/optimizer/paths.h                 |    4 +
 src/interfaces/libpq/encnames.c               |    1 +
 src/interfaces/libpq/wchar.c                  |    1 +
 src/test/regress/expected/select_distinct.out |  599 ++++++++
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/sql/select_distinct.sql      |  248 ++++
 52 files changed, 4334 insertions(+), 544 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtskip.c
 create mode 120000 src/interfaces/libpq/encnames.c
 create mode 120000 src/interfaces/libpq/wchar.c

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index e4d501a85d..73d7b63b3b 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2508,7 +2508,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, itup, NULL);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -2944,7 +2944,7 @@ bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, itup, NULL);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index d3bf8665df..6170e06289 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -134,6 +134,9 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = blcostestimate;
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b353c61683..aa89e96fc0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4620,6 +4620,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+      <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of index-skip-scan plan
+        types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+        <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-material" xreflabel="enable_material">
       <term><varname>enable_material</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index af87f172a7..4ec3605b8f 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -150,6 +150,9 @@ typedef struct IndexAmRoutine
     amendscan_function amendscan;
     ammarkpos_function ammarkpos;       /* can be NULL */
     amrestrpos_function amrestrpos;     /* can be NULL */
+    amskip_function amskip;                        /* can be NULL */
+    ambeginscan_skip_function ambeginskipscan;     /* can be NULL */
+    amgettuple_with_skip_function amgetskiptuple;  /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -676,6 +679,122 @@ amrestrpos (IndexScanDesc scan);
    struct may be set to NULL.
   </para>
 
+  <para>
+<programlisting>
+bool
+amskip (IndexScanDesc scan,
+        ScanDirection prefixDir,
+	ScanDirection postfixDir);
+</programlisting>
+  Skip past all tuples where the first 'prefix' columns have the same value as
+  the last tuple returned in the current scan. The arguments are:
+
+   <variablelist>
+    <varlistentry>
+     <term><parameter>scan</parameter></term>
+     <listitem>
+      <para>
+       Index scan information
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>prefixDir</parameter></term>
+     <listitem>
+      <para>
+       The direction in which the prefix part of the tuple is advancing.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>postfixDir</parameter></term>
+     <listitem>
+      <para>
+        The direction in which the postfix (everything after the prefix) of the tuple is advancing.
+      </para>
+     </listitem>
+    </varlistentry>
+   </variablelist>
+
+  </para>
+  <para>
+<programlisting>
+IndexScanDesc
+ambeginscan_skip (Relation indexRelation,
+             int nkeys,
+	     int norderbys,
+	     int prefix);
+</programlisting>
+   Prepare for an index scan.  The <literal>nkeys</literal> and <literal>norderbys</literal>
+   parameters indicate the number of quals and ordering operators that will be
+   used in the scan; these may be useful for space allocation purposes.
+   Note that the actual values of the scan keys aren't provided yet.
+   The result must be a palloc'd struct.
+   For implementation reasons the index access method
+   <emphasis>must</emphasis> create this struct by calling
+   <function>RelationGetIndexScan()</function>.  In most cases
+   <function>ambeginscan</function> does little beyond making that call and perhaps
+   acquiring locks;
+   the interesting parts of index-scan startup are in <function>amrescan</function>.
+   If this is a skip scan, prefix must indicate the length of the prefix that can be
+   skipped over. Prefix can be set to -1 to disable skipping, which will yield an
+   identical scan to a regular call to <function>ambeginscan</function>.
+  </para>
+  <para>
+  <programlisting>
+  boolean
+  amgettuple_skip (IndexScanDesc scan,
+              ScanDirection prefixDir,
+	      ScanDirection postfixDir);
+  </programlisting>
+     Fetch the next tuple in the given scan, moving in the given
+     directions. Directions are specified by the direction of the prefix we're moving in,
+     of which the size of the prefix has been specified in the <function>btbeginscan_skip</function>
+     call. This direction can be different in DISTINCT scans when fetching backwards
+     from a cursor.
+     Returns true if a tuple was
+     obtained, false if no matching tuples remain.  In the true case the tuple
+     TID is stored into the <literal>scan</literal> structure.  Note that
+     <quote>success</quote> means only that the index contains an entry that matches
+     the scan keys, not that the tuple necessarily still exists in the heap or
+     will pass the caller's snapshot test.  On success, <function>amgettuple</function>
+     must also set <literal>scan-&gt;xs_recheck</literal> to true or false.
+     False means it is certain that the index entry matches the scan keys.
+     true means this is not certain, and the conditions represented by the
+     scan keys must be rechecked against the heap tuple after fetching it.
+     This provision supports <quote>lossy</quote> index operators.
+     Note that rechecking will extend only to the scan conditions; a partial
+     index predicate (if any) is never rechecked by <function>amgettuple</function>
+     callers.
+    </para>
+
+    <para>
+     If the index supports <link linkend="indexes-index-only-scans">index-only
+     scans</link> (i.e., <function>amcanreturn</function> returns true for it),
+     then on success the AM must also check <literal>scan-&gt;xs_want_itup</literal>,
+     and if that is true it must return the originally indexed data for the
+     index entry.  The data can be returned in the form of an
+     <structname>IndexTuple</structname> pointer stored at <literal>scan-&gt;xs_itup</literal>,
+     with tuple descriptor <literal>scan-&gt;xs_itupdesc</literal>; or in the form of
+     a <structname>HeapTuple</structname> pointer stored at <literal>scan-&gt;xs_hitup</literal>,
+     with tuple descriptor <literal>scan-&gt;xs_hitupdesc</literal>.  (The latter
+     format should be used when reconstructing data that might possibly not fit
+     into an <structname>IndexTuple</structname>.)  In either case,
+     management of the data referenced by the pointer is the access method's
+     responsibility.  The data must remain good at least until the next
+     <function>amgettuple</function>, <function>amrescan</function>, or <function>amendscan</function>
+     call for the scan.
+    </para>
+
+    <para>
+     The <function>amgettuple</function> function need only be provided if the access
+     method supports <quote>plain</quote> index scans.  If it doesn't, the
+     <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
+     struct must be set to NULL.
+    </para>
+
   <para>
    In addition to supporting ordinary index scans, some types of index
    may wish to support <firstterm>parallel index scans</firstterm>, which allow
@@ -691,7 +810,7 @@ amrestrpos (IndexScanDesc scan);
    functions may be implemented to support parallel index scans:
   </para>
 
-  <para>
+    <para>
 <programlisting>
 Size
 amestimateparallelscan (void);
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 28adaba72d..1eed07adfc 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1281,6 +1281,34 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
    and later will recognize such cases and allow index-only scans to be
    generated, but older versions will not.
   </para>
+
+  <sect2 id="indexes-index-skip-scans">
+    <title>Index Skip Scans</title>
+
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index</primary>
+      <secondary>index-skip scans</secondary>
+    </indexterm>
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index-skip scan</primary>
+    </indexterm>
+
+    <para>
+     When the rows retrieved from an index scan are then deduplicated by
+     eliminating rows matching on a prefix of index keys (e.g. when using
+     <literal>SELECT DISTINCT</literal>), the planner will consider
+     skipping groups of rows with a matching key prefix. When a row with
+     a particular prefix is found, remaining rows with the same key prefix
+     are skipped.  The larger the number of rows with the same key prefix
+     rows (i.e. the lower the number of distinct key prefixes in the index),
+     the more efficient this is.
+    </para>
+    <para>
+      Additionally, a skip scan can be considered in regular <literal>SELECT</literal>
+      queries. When filtering on an non-leading attribute of an index, the planner
+      may choose a skip scan.
+    </para>
+  </sect2>
  </sect1>
 
 
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 7db3ae5ee0..742fa3f634 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -115,6 +115,9 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = brincostestimate;
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index a400f1fedb..9ffd0fb173 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -66,6 +66,9 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = gincostestimate;
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 79fe6eb8d6..36dc5a6c2c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -87,6 +87,9 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = gistcostestimate;
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 3ec6d528e7..1bedd55b06 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -84,6 +84,9 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = hashcostestimate;
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 6b9750c244..277db46f8a 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -14,7 +14,9 @@
  *		index_open		- open an index relation by relation OID
  *		index_close		- close an index relation
  *		index_beginscan - start a scan of an index with amgettuple
+ *		index_beginscan_skip - start a scan of an index with amgettuple and skipping
  *		index_beginscan_bitmap - start a scan of an index with amgetbitmap
+ *		index_beginscan_bitmap_skip - start a skip scan of an index with amgetbitmap
  *		index_rescan	- restart a scan of an index
  *		index_endscan	- end a scan
  *		index_insert	- insert an index tuple into a relation
@@ -25,14 +27,17 @@
  *		index_parallelrescan  - (re)start a parallel scan of an index
  *		index_beginscan_parallel - join parallel index scan
  *		index_getnext_tid	- get the next TID from a scan
+ *		index_getnext_tid_skip	- get the next TID from a skip scan
  *		index_fetch_heap		- get the scan's next heap tuple
  *		index_getnext_slot	- get the next tuple from a scan
+ *		index_getnext_slot	- get the next tuple from a skip scan
  *		index_getbitmap - get all tuples from a scan
  *		index_bulk_delete	- bulk deletion of index tuples
  *		index_vacuum_cleanup	- post-deletion cleanup of an index
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_skip		- advance past duplicate key values in a scan
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -222,6 +227,78 @@ index_beginscan(Relation heapRelation,
 	return scan;
 }
 
+static IndexScanDesc
+index_beginscan_internal_skip(Relation indexRelation,
+						 int nkeys, int norderbys, int prefix, Snapshot snapshot,
+						 ParallelIndexScanDesc pscan, bool temp_snap)
+{
+	IndexScanDesc scan;
+
+	RELATION_CHECKS;
+	CHECK_REL_PROCEDURE(ambeginskipscan);
+
+	if (!(indexRelation->rd_indam->ampredlocks))
+		PredicateLockRelation(indexRelation, snapshot);
+
+	/*
+	 * We hold a reference count to the relcache entry throughout the scan.
+	 */
+	RelationIncrementReferenceCount(indexRelation);
+
+	/*
+	 * Tell the AM to open a scan.
+	 */
+	scan = indexRelation->rd_indam->ambeginskipscan(indexRelation, nkeys,
+												norderbys, prefix);
+	/* Initialize information for parallel scan. */
+	scan->parallel_scan = pscan;
+	scan->xs_temp_snap = temp_snap;
+
+	return scan;
+}
+
+IndexScanDesc
+index_beginscan_skip(Relation heapRelation,
+				Relation indexRelation,
+				Snapshot snapshot,
+				int nkeys, int norderbys, int prefix)
+{
+	IndexScanDesc scan;
+
+	scan = index_beginscan_internal_skip(indexRelation, nkeys, norderbys, prefix, snapshot, NULL, false);
+
+	/*
+	 * Save additional parameters into the scandesc.  Everything else was set
+	 * up by RelationGetIndexScan.
+	 */
+	scan->heapRelation = heapRelation;
+	scan->xs_snapshot = snapshot;
+
+	/* prepare to fetch index matches from table */
+	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+
+	return scan;
+}
+
+IndexScanDesc
+index_beginscan_bitmap_skip(Relation indexRelation,
+					   Snapshot snapshot,
+					   int nkeys,
+					   int prefix)
+{
+	IndexScanDesc scan;
+
+	scan = index_beginscan_internal_skip(indexRelation, nkeys, 0, prefix, snapshot, NULL, false);
+
+	/*
+	 * Save additional parameters into the scandesc.  Everything else was set
+	 * up by RelationGetIndexScan.
+	 */
+	scan->xs_snapshot = snapshot;
+
+	return scan;
+}
+
 /*
  * index_beginscan_bitmap - start a scan of an index with amgetbitmap
  *
@@ -550,6 +627,45 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return &scan->xs_heaptid;
 }
 
+ItemPointer
+index_getnext_tid_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	bool		found;
+
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgetskiptuple);
+
+	Assert(TransactionIdIsValid(RecentGlobalXmin));
+
+	/*
+	 * The AM's amgettuple proc finds the next index entry matching the scan
+	 * keys, and puts the TID into scan->xs_heaptid.  It should also set
+	 * scan->xs_recheck and possibly scan->xs_itup/scan->xs_hitup, though we
+	 * pay no attention to those fields here.
+	 */
+	found = scan->indexRelation->rd_indam->amgetskiptuple(scan, prefixDir, postfixDir);
+
+	/* Reset kill flag immediately for safety */
+	scan->kill_prior_tuple = false;
+	scan->xs_heap_continue = false;
+
+	/* If we're out of index entries, we're done */
+	if (!found)
+	{
+		/* release resources (like buffer pins) from table accesses */
+		if (scan->xs_heapfetch)
+			table_index_fetch_reset(scan->xs_heapfetch);
+
+		return NULL;
+	}
+	Assert(ItemPointerIsValid(&scan->xs_heaptid));
+
+	pgstat_count_index_tuples(scan->indexRelation, 1);
+
+	/* Return the TID of the tuple we found. */
+	return &scan->xs_heaptid;
+}
+
 /* ----------------
  *		index_fetch_heap - get the scan's next heap tuple
  *
@@ -641,6 +757,38 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
 	return false;
 }
 
+bool
+index_getnext_slot_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir, TupleTableSlot *slot)
+{
+	for (;;)
+	{
+		if (!scan->xs_heap_continue)
+		{
+			ItemPointer tid;
+
+			/* Time to fetch the next TID from the index */
+			tid = index_getnext_tid_skip(scan, prefixDir, postfixDir);
+
+			/* If we're out of index entries, we're done */
+			if (tid == NULL)
+				break;
+
+			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+		}
+
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scan->xs_heaptid));
+		if (index_fetch_heap(scan, slot))
+			return true;
+	}
+
+	return false;
+}
+
 /* ----------------
  *		index_getbitmap - get all tuples at once from an index scan
  *
@@ -736,6 +884,21 @@ index_can_return(Relation indexRelation, int attno)
 	return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
 }
 
+/* ----------------
+ *		index_skip
+ *
+ *		Skip past all tuples where the first 'prefix' columns have the
+ *		same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	SCAN_CHECKS;
+
+	return scan->indexRelation->rd_indam->amskip(scan, prefixDir, postfixDir);
+}
+
 /* ----------------
  *		index_getprocid
  *
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index d69808e78c..da96ac00a6 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -19,6 +19,7 @@ OBJS = \
 	nbtpage.o \
 	nbtree.o \
 	nbtsearch.o \
+	nbtskip.o \
 	nbtsort.o \
 	nbtsplitloc.o \
 	nbtutils.o \
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b86c122763..d020d921c4 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -89,7 +89,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, itup, NULL);
 
 	if (checkingunique)
 	{
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 75628e0eb9..9794e65a0a 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1594,7 +1594,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, targetkey, NULL);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
 				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e947addef6..92283881e2 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -136,14 +136,17 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
+	amroutine->amskip = btskip;
 	amroutine->amcostestimate = btcostestimate;
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
 	amroutine->ambuildphasename = btbuildphasename;
 	amroutine->amvalidate = btvalidate;
 	amroutine->ambeginscan = btbeginscan;
+	amroutine->ambeginskipscan = btbeginscan_skip;
 	amroutine->amrescan = btrescan;
 	amroutine->amgettuple = btgettuple;
+	amroutine->amgetskiptuple = btgettuple_skip;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
@@ -219,6 +222,15 @@ btinsert(Relation rel, Datum *values, bool *isnull,
  */
 bool
 btgettuple(IndexScanDesc scan, ScanDirection dir)
+{
+	return btgettuple_skip(scan, dir, dir);
+}
+
+/*
+ *	btgettuple() -- Get the next tuple in the scan.
+ */
+bool
+btgettuple_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		res;
@@ -237,7 +249,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (so->numArrayKeys < 0)
 			return false;
 
-		_bt_start_array_keys(scan, dir);
+		_bt_start_array_keys(scan, prefixDir);
 	}
 
 	/* This loop handles advancing to the next array elements, if any */
@@ -249,7 +261,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_first() to get the first item in the scan.
 		 */
 		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+			res = _bt_first(scan, prefixDir, postfixDir);
 		else
 		{
 			/*
@@ -276,14 +288,14 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 			/*
 			 * Now continue the scan.
 			 */
-			res = _bt_next(scan, dir);
+			res = _bt_next(scan, prefixDir, postfixDir);
 		}
 
 		/* If we have a tuple, return it ... */
 		if (res)
 			break;
 		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+	} while (so->numArrayKeys && _bt_advance_array_keys(scan, prefixDir));
 
 	return res;
 }
@@ -314,7 +326,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	do
 	{
 		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		if (_bt_first(scan, ForwardScanDirection, ForwardScanDirection))
 		{
 			/* Save tuple ID, and continue scanning */
 			heapTid = &scan->xs_heaptid;
@@ -330,7 +342,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 				if (++so->currPos.itemIndex > so->currPos.lastItem)
 				{
 					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					if (!_bt_next(scan, ForwardScanDirection, ForwardScanDirection))
 						break;
 				}
 
@@ -351,6 +363,16 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
  */
 IndexScanDesc
 btbeginscan(Relation rel, int nkeys, int norderbys)
+{
+	return btbeginscan_skip(rel, nkeys, norderbys, -1);
+}
+
+
+/*
+ *	btbeginscan() -- start a scan on a btree index
+ */
+IndexScanDesc
+btbeginscan_skip(Relation rel, int nkeys, int norderbys, int skipPrefix)
 {
 	IndexScanDesc scan;
 	BTScanOpaque so;
@@ -385,10 +407,18 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	 */
 	so->currTuples = so->markTuples = NULL;
 
+	so->skipData = NULL;
+
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
 
+	if (skipPrefix > 0)
+	{
+		so->skipData = (BTSkip) palloc0(sizeof(BTSkipData));
+		so->skipData->prefix = skipPrefix;
+	}
+
 	return scan;
 }
 
@@ -452,6 +482,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	_bt_preprocess_array_keys(scan);
 }
 
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	return _bt_skip(scan, prefixDir, postfixDir);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -485,6 +524,8 @@ btendscan(IndexScanDesc scan)
 	if (so->currTuples != NULL)
 		pfree(so->currTuples);
 	/* so->markTuples should not be pfree'd, see btrescan */
+	if (_bt_skip_enabled(so))
+		pfree(so->skipData);
 	pfree(so);
 }
 
@@ -568,6 +609,9 @@ btrestrpos(IndexScanDesc scan)
 			if (so->currTuples)
 				memcpy(so->currTuples, so->markTuples,
 					   so->markPos.nextTupleOffset);
+			if (so->skipData)
+				memcpy(&so->skipData->curPos, &so->skipData->markPos,
+					   sizeof(BTSkipPosData));
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index f228c87a2b..da4b177e80 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -17,19 +17,17 @@
 
 #include "access/nbtree.h"
 #include "access/relscan.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/predicate.h"
+#include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
 
-static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
 static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
@@ -38,14 +36,12 @@ static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
 static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
 									   OffsetNumber offnum,
 									   ItemPointer heapTid, int tupleOffset);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline bool _bt_checkkeys_extended(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+										  ScanDirection dir, bool isRegularMode,
+										  bool *continuescan, int *prefixskipindex);
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -61,7 +57,7 @@ static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
  * will remain in shared memory for as long as it takes to scan the index
  * buffer page.
  */
-static void
+void
 _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
 {
 	LockBuffer(sp->buf, BUFFER_LOCK_UNLOCK);
@@ -340,7 +336,7 @@ _bt_moveright(Relation rel,
  * the given page.  _bt_binsrch() has no lock or refcount side effects
  * on the buffer.
  */
-static OffsetNumber
+OffsetNumber
 _bt_binsrch(Relation rel,
 			BTScanInsert key,
 			Buffer buf)
@@ -846,25 +842,23 @@ _bt_compare(Relation rel,
  * in locating the scan start position.
  */
 bool
-_bt_first(IndexScanDesc scan, ScanDirection dir)
+_bt_first(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
 	OffsetNumber offnum;
-	StrategyNumber strat;
-	bool		nextkey;
 	bool		goback;
 	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
-	int			i;
 	bool		status = true;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
 	BlockNumber blkno;
+	IndexTuple itup;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -901,184 +895,13 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 		else if (blkno != InvalidBlockNumber)
 		{
-			if (!_bt_parallel_readpage(scan, blkno, dir))
+			if (!_bt_parallel_readpage(scan, blkno, prefixDir))
 				return false;
 			goto readcomplete;
 		}
 	}
 
-	/*----------
-	 * Examine the scan keys to discover where we need to start the scan.
-	 *
-	 * We want to identify the keys that can be used as starting boundaries;
-	 * these are =, >, or >= keys for a forward scan or =, <, <= keys for
-	 * a backwards scan.  We can use keys for multiple attributes so long as
-	 * the prior attributes had only =, >= (resp. =, <=) keys.  Once we accept
-	 * a > or < boundary or find an attribute with no boundary (which can be
-	 * thought of as the same as "> -infinity"), we can't use keys for any
-	 * attributes to its right, because it would break our simplistic notion
-	 * of what initial positioning strategy to use.
-	 *
-	 * When the scan keys include cross-type operators, _bt_preprocess_keys
-	 * may not be able to eliminate redundant keys; in such cases we will
-	 * arbitrarily pick a usable one for each attribute.  This is correct
-	 * but possibly not optimal behavior.  (For example, with keys like
-	 * "x >= 4 AND x >= 5" we would elect to scan starting at x=4 when
-	 * x=5 would be more efficient.)  Since the situation only arises given
-	 * a poorly-worded query plus an incomplete opfamily, live with it.
-	 *
-	 * When both equality and inequality keys appear for a single attribute
-	 * (again, only possible when cross-type operators appear), we *must*
-	 * select one of the equality keys for the starting point, because
-	 * _bt_checkkeys() will stop the scan as soon as an equality qual fails.
-	 * For example, if we have keys like "x >= 4 AND x = 10" and we elect to
-	 * start at x=4, we will fail and stop before reaching x=10.  If multiple
-	 * equality quals survive preprocessing, however, it doesn't matter which
-	 * one we use --- by definition, they are either redundant or
-	 * contradictory.
-	 *
-	 * Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
-	 * If the index stores nulls at the end of the index we'll be starting
-	 * from, and we have no boundary key for the column (which means the key
-	 * we deduced NOT NULL from is an inequality key that constrains the other
-	 * end of the index), then we cons up an explicit SK_SEARCHNOTNULL key to
-	 * use as a boundary key.  If we didn't do this, we might find ourselves
-	 * traversing a lot of null entries at the start of the scan.
-	 *
-	 * In this loop, row-comparison keys are treated the same as keys on their
-	 * first (leftmost) columns.  We'll add on lower-order columns of the row
-	 * comparison below, if possible.
-	 *
-	 * The selected scan keys (at most one per index column) are remembered by
-	 * storing their addresses into the local startKeys[] array.
-	 *----------
-	 */
-	strat_total = BTEqualStrategyNumber;
-	if (so->numberOfKeys > 0)
-	{
-		AttrNumber	curattr;
-		ScanKey		chosen;
-		ScanKey		impliesNN;
-		ScanKey		cur;
-
-		/*
-		 * chosen is the so-far-chosen key for the current attribute, if any.
-		 * We don't cast the decision in stone until we reach keys for the
-		 * next attribute.
-		 */
-		curattr = 1;
-		chosen = NULL;
-		/* Also remember any scankey that implies a NOT NULL constraint */
-		impliesNN = NULL;
-
-		/*
-		 * Loop iterates from 0 to numberOfKeys inclusive; we use the last
-		 * pass to handle after-last-key processing.  Actual exit from the
-		 * loop is at one of the "break" statements below.
-		 */
-		for (cur = so->keyData, i = 0;; cur++, i++)
-		{
-			if (i >= so->numberOfKeys || cur->sk_attno != curattr)
-			{
-				/*
-				 * Done looking at keys for curattr.  If we didn't find a
-				 * usable boundary key, see if we can deduce a NOT NULL key.
-				 */
-				if (chosen == NULL && impliesNN != NULL &&
-					((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
-					 ScanDirectionIsForward(dir) :
-					 ScanDirectionIsBackward(dir)))
-				{
-					/* Yes, so build the key in notnullkeys[keysCount] */
-					chosen = &notnullkeys[keysCount];
-					ScanKeyEntryInitialize(chosen,
-										   (SK_SEARCHNOTNULL | SK_ISNULL |
-											(impliesNN->sk_flags &
-											 (SK_BT_DESC | SK_BT_NULLS_FIRST))),
-										   curattr,
-										   ((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
-											BTGreaterStrategyNumber :
-											BTLessStrategyNumber),
-										   InvalidOid,
-										   InvalidOid,
-										   InvalidOid,
-										   (Datum) 0);
-				}
-
-				/*
-				 * If we still didn't find a usable boundary key, quit; else
-				 * save the boundary key pointer in startKeys.
-				 */
-				if (chosen == NULL)
-					break;
-				startKeys[keysCount++] = chosen;
-
-				/*
-				 * Adjust strat_total, and quit if we have stored a > or <
-				 * key.
-				 */
-				strat = chosen->sk_strategy;
-				if (strat != BTEqualStrategyNumber)
-				{
-					strat_total = strat;
-					if (strat == BTGreaterStrategyNumber ||
-						strat == BTLessStrategyNumber)
-						break;
-				}
-
-				/*
-				 * Done if that was the last attribute, or if next key is not
-				 * in sequence (implying no boundary key is available for the
-				 * next attribute).
-				 */
-				if (i >= so->numberOfKeys ||
-					cur->sk_attno != curattr + 1)
-					break;
-
-				/*
-				 * Reset for next attr.
-				 */
-				curattr = cur->sk_attno;
-				chosen = NULL;
-				impliesNN = NULL;
-			}
-
-			/*
-			 * Can we use this key as a starting boundary for this attr?
-			 *
-			 * If not, does it imply a NOT NULL constraint?  (Because
-			 * SK_SEARCHNULL keys are always assigned BTEqualStrategyNumber,
-			 * *any* inequality key works for that; we need not test.)
-			 */
-			switch (cur->sk_strategy)
-			{
-				case BTLessStrategyNumber:
-				case BTLessEqualStrategyNumber:
-					if (chosen == NULL)
-					{
-						if (ScanDirectionIsBackward(dir))
-							chosen = cur;
-						else
-							impliesNN = cur;
-					}
-					break;
-				case BTEqualStrategyNumber:
-					/* override any non-equality choice */
-					chosen = cur;
-					break;
-				case BTGreaterEqualStrategyNumber:
-				case BTGreaterStrategyNumber:
-					if (chosen == NULL)
-					{
-						if (ScanDirectionIsForward(dir))
-							chosen = cur;
-						else
-							impliesNN = cur;
-					}
-					break;
-			}
-		}
-	}
+	keysCount = _bt_choose_scan_keys(so->keyData, so->numberOfKeys, prefixDir, startKeys, notnullkeys, &strat_total, 0);
 
 	/*
 	 * If we found no usable boundary keys, we have to start from one end of
@@ -1089,260 +912,112 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	{
 		bool		match;
 
-		match = _bt_endpoint(scan, dir);
-
-		if (!match)
+		if (!_bt_skip_enabled(so))
 		{
-			/* No match, so mark (parallel) scan finished */
-			_bt_parallel_done(scan);
-		}
+			match = _bt_endpoint(scan, prefixDir);
 
-		return match;
-	}
+			if (!match)
+			{
+				/* No match, so mark (parallel) scan finished */
+				_bt_parallel_done(scan);
+			}
 
-	/*
-	 * We want to start the scan somewhere within the index.  Set up an
-	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built using the keys
-	 * identified by startKeys[].  (Remaining insertion scankey fields are
-	 * initialized after initial-positioning strategy is finalized.)
-	 */
-	Assert(keysCount <= INDEX_MAX_KEYS);
-	for (i = 0; i < keysCount; i++)
-	{
-		ScanKey		cur = startKeys[i];
+			return match;
+		}
+		else
+		{
+			Relation	rel = scan->indexRelation;
+			Buffer		buf;
+			Page		page;
+			BTPageOpaque opaque;
+			OffsetNumber start;
+			BTSkipCompareResult cmp = {0};
 
-		Assert(cur->sk_attno == i + 1);
+			_bt_skip_create_scankeys(rel, so);
 
-		if (cur->sk_flags & SK_ROW_HEADER)
-		{
 			/*
-			 * Row comparison header: look to the first row member instead.
-			 *
-			 * The member scankeys are already in insertion format (ie, they
-			 * have sk_func = 3-way-comparison function), but we have to watch
-			 * out for nulls, which _bt_preprocess_keys didn't check. A null
-			 * in the first row member makes the condition unmatchable, just
-			 * like qual_ok = false.
+			 * Scan down to the leftmost or rightmost leaf page and position
+			 * the scan on the leftmost or rightmost item on that page.
+			 * Start the skip scan from there to find the first matching item
 			 */
-			ScanKey		subkey = (ScanKey) DatumGetPointer(cur->sk_argument);
+			buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(prefixDir), scan->xs_snapshot);
 
-			Assert(subkey->sk_flags & SK_ROW_MEMBER);
-			if (subkey->sk_flags & SK_ISNULL)
+			if (!BufferIsValid(buf))
 			{
-				_bt_parallel_done(scan);
+				/*
+				 * Empty index. Lock the whole relation, as nothing finer to lock
+				 * exists.
+				 */
+				PredicateLockRelation(rel, scan->xs_snapshot);
+				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
-			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
 
-			/*
-			 * If the row comparison is the last positioning key we accepted,
-			 * try to add additional keys from the lower-order row members.
-			 * (If we accepted independent conditions on additional index
-			 * columns, we use those instead --- doesn't seem worth trying to
-			 * determine which is more restrictive.)  Note that this is OK
-			 * even if the row comparison is of ">" or "<" type, because the
-			 * condition applied to all but the last row member is effectively
-			 * ">=" or "<=", and so the extra keys don't break the positioning
-			 * scheme.  But, by the same token, if we aren't able to use all
-			 * the row members, then the part of the row comparison that we
-			 * did use has to be treated as just a ">=" or "<=" condition, and
-			 * so we'd better adjust strat_total accordingly.
-			 */
-			if (i == keysCount - 1)
+			PredicateLockPage(rel, BufferGetBlockNumber(buf), scan->xs_snapshot);
+			page = BufferGetPage(buf);
+			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			Assert(P_ISLEAF(opaque));
+
+			if (ScanDirectionIsForward(prefixDir))
 			{
-				bool		used_all_subkeys = false;
+				/* There could be dead pages to the left, so not this: */
+				/* Assert(P_LEFTMOST(opaque)); */
 
-				Assert(!(subkey->sk_flags & SK_ROW_END));
-				for (;;)
-				{
-					subkey++;
-					Assert(subkey->sk_flags & SK_ROW_MEMBER);
-					if (subkey->sk_attno != keysCount + 1)
-						break;	/* out-of-sequence, can't use it */
-					if (subkey->sk_strategy != cur->sk_strategy)
-						break;	/* wrong direction, can't use it */
-					if (subkey->sk_flags & SK_ISNULL)
-						break;	/* can't use null keys */
-					Assert(keysCount < INDEX_MAX_KEYS);
-					memcpy(inskey.scankeys + keysCount, subkey,
-						   sizeof(ScanKeyData));
-					keysCount++;
-					if (subkey->sk_flags & SK_ROW_END)
-					{
-						used_all_subkeys = true;
-						break;
-					}
-				}
-				if (!used_all_subkeys)
-				{
-					switch (strat_total)
-					{
-						case BTLessStrategyNumber:
-							strat_total = BTLessEqualStrategyNumber;
-							break;
-						case BTGreaterStrategyNumber:
-							strat_total = BTGreaterEqualStrategyNumber;
-							break;
-					}
-				}
-				break;			/* done with outer loop */
+				start = P_FIRSTDATAKEY(opaque);
 			}
-		}
-		else
-		{
-			/*
-			 * Ordinary comparison key.  Transform the search-style scan key
-			 * to an insertion scan key by replacing the sk_func with the
-			 * appropriate btree comparison function.
-			 *
-			 * If scankey operator is not a cross-type comparison, we can use
-			 * the cached comparison function; otherwise gotta look it up in
-			 * the catalogs.  (That can't lead to infinite recursion, since no
-			 * indexscan initiated by syscache lookup will use cross-data-type
-			 * operators.)
-			 *
-			 * We support the convention that sk_subtype == InvalidOid means
-			 * the opclass input type; this is a hack to simplify life for
-			 * ScanKeyInit().
-			 */
-			if (cur->sk_subtype == rel->rd_opcintype[i] ||
-				cur->sk_subtype == InvalidOid)
+			else if (ScanDirectionIsBackward(prefixDir))
 			{
-				FmgrInfo   *procinfo;
-
-				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
-				ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
-											   cur->sk_flags,
-											   cur->sk_attno,
-											   InvalidStrategy,
-											   cur->sk_subtype,
-											   cur->sk_collation,
-											   procinfo,
-											   cur->sk_argument);
+				Assert(P_RIGHTMOST(opaque));
+
+				start = PageGetMaxOffsetNumber(page);
 			}
 			else
 			{
-				RegProcedure cmp_proc;
-
-				cmp_proc = get_opfamily_proc(rel->rd_opfamily[i],
-											 rel->rd_opcintype[i],
-											 cur->sk_subtype,
-											 BTORDER_PROC);
-				if (!RegProcedureIsValid(cmp_proc))
-					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
-						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
-						 cur->sk_attno, RelationGetRelationName(rel));
-				ScanKeyEntryInitialize(inskey.scankeys + i,
-									   cur->sk_flags,
-									   cur->sk_attno,
-									   InvalidStrategy,
-									   cur->sk_subtype,
-									   cur->sk_collation,
-									   cmp_proc,
-									   cur->sk_argument);
+				elog(ERROR, "invalid scan direction: %d", (int) prefixDir);
 			}
-		}
-	}
 
-	/*----------
-	 * Examine the selected initial-positioning strategy to determine exactly
-	 * where we need to start the scan, and set flag variables to control the
-	 * code below.
-	 *
-	 * If nextkey = false, _bt_search and _bt_binsrch will locate the first
-	 * item >= scan key.  If nextkey = true, they will locate the first
-	 * item > scan key.
-	 *
-	 * If goback = true, we will then step back one item, while if
-	 * goback = false, we will start the scan on the located item.
-	 *----------
-	 */
-	switch (strat_total)
-	{
-		case BTLessStrategyNumber:
-
-			/*
-			 * Find first item >= scankey, then back up one to arrive at last
-			 * item < scankey.  (Note: this positioning strategy is only used
-			 * for a backward scan, so that is always the correct starting
-			 * position.)
-			 */
-			nextkey = false;
-			goback = true;
-			break;
-
-		case BTLessEqualStrategyNumber:
-
-			/*
-			 * Find first item > scankey, then back up one to arrive at last
-			 * item <= scankey.  (Note: this positioning strategy is only used
-			 * for a backward scan, so that is always the correct starting
-			 * position.)
-			 */
-			nextkey = true;
-			goback = true;
-			break;
-
-		case BTEqualStrategyNumber:
-
-			/*
-			 * If a backward scan was specified, need to start with last equal
-			 * item not first one.
+			/* remember which buffer we have pinned */
+			so->currPos.buf = buf;
+			so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, start));
+			/* in some cases, we can (or have to) skip further inside the prefix.
+			 * we can do this if we have extra quals becoming available, eg.
+			 * WHERE b=2 on an index on (a,b).
+			 * We must, if this is not regular mode (prefixDir!=postfixDir).
+			 * Because this means we're at the end of the prefix, while we should be
+			 * at the beginning.
 			 */
-			if (ScanDirectionIsBackward(dir))
+			if (_bt_has_extra_quals_after_skip(so->skipData, postfixDir, 0) ||
+					!_bt_skip_is_regular_mode(prefixDir, postfixDir))
 			{
-				/*
-				 * This is the same as the <= strategy.  We will check at the
-				 * end whether the found item is actually =.
-				 */
-				nextkey = true;
-				goback = true;
+				_bt_skip_extra_conditions(scan, &itup, &start, prefixDir, postfixDir, &cmp);
 			}
-			else
+			/* now find the next matching tuple */
+			match = _bt_skip_find_next(scan, itup, start, prefixDir, postfixDir);
+			if (!match)
 			{
-				/*
-				 * This is the same as the >= strategy.  We will check at the
-				 * end whether the found item is actually =.
-				 */
-				nextkey = false;
-				goback = false;
+				if (_bt_skip_is_always_valid(so))
+					_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+				return false;
 			}
-			break;
 
-		case BTGreaterEqualStrategyNumber:
+			_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
-			/*
-			 * Find first item >= scankey.  (This is only used for forward
-			 * scans.)
-			 */
-			nextkey = false;
-			goback = false;
-			break;
-
-		case BTGreaterStrategyNumber:
-
-			/*
-			 * Find first item > scankey.  (This is only used for forward
-			 * scans.)
-			 */
-			nextkey = true;
-			goback = false;
-			break;
+			currItem = &so->currPos.items[so->currPos.itemIndex];
+			scan->xs_heaptid = currItem->heapTid;
+			if (scan->xs_want_itup)
+				scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
 
-		default:
-			/* can't get here, but keep compiler quiet */
-			elog(ERROR, "unrecognized strat_total: %d", (int) strat_total);
-			return false;
+			return true;
+		}
 	}
 
-	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
-	inskey.anynullkeys = false; /* unused */
-	inskey.nextkey = nextkey;
-	inskey.pivotsearch = false;
-	inskey.scantid = NULL;
-	inskey.keysz = keysCount;
+	if (!_bt_create_insertion_scan_key(rel, prefixDir, startKeys, keysCount, &inskey, &strat_total,  &goback))
+	{
+		_bt_parallel_done(scan);
+		return false;
+	}
 
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
@@ -1374,7 +1049,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		PredicateLockPage(rel, BufferGetBlockNumber(buf),
 						  scan->xs_snapshot);
 
-	_bt_initialize_more_data(so, dir);
+	_bt_initialize_more_data(so, prefixDir);
 
 	/* position to the precise item on the page */
 	offnum = _bt_binsrch(rel, &inskey, buf);
@@ -1404,23 +1079,79 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	Assert(!BTScanPosIsValid(so->currPos));
 	so->currPos.buf = buf;
 
-	/*
-	 * Now load data from the first page of the scan.
-	 */
-	if (!_bt_readpage(scan, dir, offnum))
+	if (_bt_skip_enabled(so))
 	{
-		/*
-		 * There's no actually-matching data on this page.  Try to advance to
-		 * the next page.  Return false if there's no matching data at all.
+		Page page;
+		BTPageOpaque opaque;
+		OffsetNumber minoff;
+		bool match;
+		BTSkipCompareResult cmp = {0};
+
+		/* first create the skip scan keys */
+		_bt_skip_create_scankeys(rel, so);
+
+		/* remember which page we have pinned */
+		so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+		page = BufferGetPage(so->currPos.buf);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		minoff = P_FIRSTDATAKEY(opaque);
+		/* _binsrch + goback parameter can leave the offnum before the first item on the page
+		 * or after the last item on the page. if that is the case we need to either step
+		 * back or forward one page
 		 */
-		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
-		if (!_bt_steppage(scan, dir))
+		if (offnum < minoff)
+		{
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+			if (!_bt_step_back_page(scan, &itup, &offnum))
+				return false;
+		}
+		else if (offnum > PageGetMaxOffsetNumber(page))
+		{
+			BlockNumber next = opaque->btpo_next;
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+			if (!_bt_step_forward_page(scan, next, &itup, &offnum))
+				return false;
+		}
+
+		itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+		/* check if we can skip even more because we can use new conditions */
+		if (_bt_has_extra_quals_after_skip(so->skipData, postfixDir, inskey.keysz) ||
+				!_bt_skip_is_regular_mode(prefixDir, postfixDir))
+		{
+			_bt_skip_extra_conditions(scan, &itup, &offnum, prefixDir, postfixDir, &cmp);
+		}
+		/* now find the tuple */
+		match = _bt_skip_find_next(scan, itup, offnum, prefixDir, postfixDir);
+		if (!match)
+		{
+			if (_bt_skip_is_always_valid(so))
+				_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 			return false;
+		}
+
+		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 	else
 	{
-		/* Drop the lock, and maybe the pin, on the current page */
-		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+		/*
+		 * Now load data from the first page of the scan.
+		 */
+		if (!_bt_readpage(scan, prefixDir, &offnum, true))
+		{
+			/*
+			 * There's no actually-matching data on this page.  Try to advance to
+			 * the next page.  Return false if there's no matching data at all.
+			 */
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+			if (!_bt_steppage(scan, prefixDir))
+				return false;
+		}
+		else
+		{
+			/* Drop the lock, and maybe the pin, on the current page */
+			_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+		}
 	}
 
 readcomplete:
@@ -1448,29 +1179,113 @@ readcomplete:
  *		so->currPos.buf to InvalidBuffer.
  */
 bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
+_bt_next(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTScanPosItem *currItem;
 
-	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
-	 */
-	if (ScanDirectionIsForward(dir))
+	if (!_bt_skip_enabled(so))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		/*
+		 * Advance to next tuple on current page; or if there's no more, try to
+		 * step to the next page with data.
+		 */
+		if (ScanDirectionIsForward(prefixDir))
 		{
-			if (!_bt_steppage(scan, dir))
-				return false;
+			if (++so->currPos.itemIndex > so->currPos.lastItem)
+			{
+				if (!_bt_steppage(scan, prefixDir))
+					return false;
+			}
+		}
+		else
+		{
+			if (--so->currPos.itemIndex < so->currPos.firstItem)
+			{
+				if (!_bt_steppage(scan, prefixDir))
+					return false;
+			}
 		}
 	}
 	else
 	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		bool match;
+		IndexTuple itup = NULL;
+		OffsetNumber offnum = InvalidOffsetNumber;
+
+		if (ScanDirectionIsForward(postfixDir))
 		{
-			if (!_bt_steppage(scan, dir))
-				return false;
+			if (++so->currPos.itemIndex > so->currPos.lastItem)
+			{
+				if (prefixDir != so->skipData->curPos.nextDirection)
+				{
+					/* this happens when doing a cursor scan and changing
+					 * direction in the meantime. eg. first fetch forwards,
+					 * then backwards.
+					 * we *always* just go to the next page instead of skipping,
+					 * because that's the only safe option.
+					 */
+					so->skipData->curPos.nextAction = SkipStateNext;
+					so->skipData->curPos.nextDirection = prefixDir;
+				}
+
+				if (so->skipData->curPos.nextAction == SkipStateNext)
+				{
+					/* we should just go forwards one page, no skipping is necessary */
+					if (!_bt_step_forward_page(scan, so->currPos.nextPage, &itup, &offnum))
+						return false;
+				}
+				else if (so->skipData->curPos.nextAction == SkipStateStop)
+				{
+					/* we've reached the end of the index, or we cannot find any more keys */
+					BTScanPosUnpinIfPinned(so->currPos);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+
+				/* now find the next tuple */
+				match = _bt_skip_find_next(scan, itup, offnum, prefixDir, postfixDir);
+				if (!match)
+				{
+					if (_bt_skip_is_always_valid(so))
+						_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+					return false;
+				}
+				_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+			}
+		}
+		else
+		{
+			if (--so->currPos.itemIndex < so->currPos.firstItem)
+			{
+				if (prefixDir != so->skipData->curPos.nextDirection)
+				{
+					so->skipData->curPos.nextAction = SkipStateNext;
+					so->skipData->curPos.nextDirection = prefixDir;
+				}
+
+				if (so->skipData->curPos.nextAction == SkipStateNext)
+				{
+					if (!_bt_step_back_page(scan, &itup, &offnum))
+						return false;
+				}
+				else if (so->skipData->curPos.nextAction == SkipStateStop)
+				{
+					BTScanPosUnpinIfPinned(so->currPos);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+
+				/* now find the next tuple */
+				match = _bt_skip_find_next(scan, itup, offnum, prefixDir, postfixDir);
+				if (!match)
+				{
+					if (_bt_skip_is_always_valid(so))
+						_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+					return false;
+				}
+				_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+			}
 		}
 	}
 
@@ -1502,8 +1317,8 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  *
  * Returns true if any matching items found on the page, false if none.
  */
-static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
+bool
+_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber *offnum, bool isRegularMode)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
@@ -1513,6 +1328,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	int			itemIndex;
 	bool		continuescan;
 	int			indnatts;
+	int			prefixskipindex;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1571,11 +1387,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
-		offnum = Max(offnum, minoff);
+		*offnum = Max(*offnum, minoff);
 
-		while (offnum <= maxoff)
+		while (*offnum <= maxoff)
 		{
-			ItemId		iid = PageGetItemId(page, offnum);
+			ItemId		iid = PageGetItemId(page, *offnum);
 			IndexTuple	itup;
 
 			/*
@@ -1584,19 +1400,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
-				offnum = OffsetNumberNext(offnum);
+				*offnum = OffsetNumberNext(*offnum);
 				continue;
 			}
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+			if (_bt_checkkeys_extended(scan, itup, indnatts, dir, isRegularMode, &continuescan, &prefixskipindex))
 			{
 				/* tuple passes all scan key conditions */
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(so, itemIndex, *offnum, itup);
 					itemIndex++;
 				}
 				else
@@ -1608,26 +1424,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 					 * TID
 					 */
 					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+						_bt_setuppostingitems(so, itemIndex, *offnum,
 											  BTreeTupleGetPostingN(itup, 0),
 											  itup);
 					itemIndex++;
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(so, itemIndex, *offnum,
 											BTreeTupleGetPostingN(itup, i),
 											tupleOffset);
 						itemIndex++;
 					}
 				}
 			}
+
+			*offnum = OffsetNumberNext(*offnum);
+
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
 				break;
-
-			offnum = OffsetNumberNext(offnum);
+			if (!isRegularMode && prefixskipindex != -1)
+				break;
 		}
+		*offnum = OffsetNumberPrev(*offnum);
 
 		/*
 		 * We don't need to visit page to the right when the high key
@@ -1647,7 +1467,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			int			truncatt;
 
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, NULL);
 		}
 
 		if (!continuescan)
@@ -1663,11 +1483,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
-		offnum = Min(offnum, maxoff);
+		*offnum = Min(*offnum, maxoff);
 
-		while (offnum >= minoff)
+		while (*offnum >= minoff)
 		{
-			ItemId		iid = PageGetItemId(page, offnum);
+			ItemId		iid = PageGetItemId(page, *offnum);
 			IndexTuple	itup;
 			bool		tuple_alive;
 			bool		passes_quals;
@@ -1684,10 +1504,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
-				Assert(offnum >= P_FIRSTDATAKEY(opaque));
-				if (offnum > P_FIRSTDATAKEY(opaque))
+				Assert(*offnum >= P_FIRSTDATAKEY(opaque));
+				if (*offnum > P_FIRSTDATAKEY(opaque))
 				{
-					offnum = OffsetNumberPrev(offnum);
+					*offnum = OffsetNumberPrev(*offnum);
 					continue;
 				}
 
@@ -1698,8 +1518,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan);
+			passes_quals = _bt_checkkeys_extended(scan, itup, indnatts, dir,
+												  isRegularMode, &continuescan, &prefixskipindex);
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
@@ -1707,7 +1527,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(so, itemIndex, *offnum, itup);
 				}
 				else
 				{
@@ -1725,28 +1545,32 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 					 */
 					itemIndex--;
 					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+						_bt_setuppostingitems(so, itemIndex, *offnum,
 											  BTreeTupleGetPostingN(itup, 0),
 											  itup);
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(so, itemIndex, *offnum,
 											BTreeTupleGetPostingN(itup, i),
 											tupleOffset);
 					}
 				}
 			}
+
+			*offnum = OffsetNumberPrev(*offnum);
+
 			if (!continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
 				break;
 			}
-
-			offnum = OffsetNumberPrev(offnum);
+			if (!isRegularMode && prefixskipindex != -1)
+				break;
 		}
+		*offnum = OffsetNumberNext(*offnum);
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
@@ -1854,7 +1678,7 @@ _bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
  * read lock, on that page.  If we do not hold the pin, we set so->currPos.buf
  * to InvalidBuffer.  We return true to indicate success.
  */
-static bool
+bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -1882,6 +1706,9 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		if (so->markTuples)
 			memcpy(so->markTuples, so->currTuples,
 				   so->currPos.nextTupleOffset);
+		if (so->skipData)
+			memcpy(&so->skipData->markPos, &so->skipData->curPos,
+				   sizeof(BTSkipPosData));
 		so->markPos.itemIndex = so->markItemIndex;
 		so->markItemIndex = -1;
 	}
@@ -1961,13 +1788,14 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * If there are no more matching records in the given direction, we drop all
  * locks and pins, set so->currPos.buf to InvalidBuffer, and return false.
  */
-static bool
+bool
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel;
 	Page		page;
 	BTPageOpaque opaque;
+	OffsetNumber offnum;
 	bool		status = true;
 
 	rel = scan->indexRelation;
@@ -1999,7 +1827,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+				offnum = P_FIRSTDATAKEY(opaque);
+				if (_bt_readpage(scan, dir, &offnum, true))
 					break;
 			}
 			else if (scan->parallel_scan != NULL)
@@ -2101,7 +1930,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+				offnum = PageGetMaxOffsetNumber(page);
+				if (_bt_readpage(scan, dir, &offnum, true))
 					break;
 			}
 			else if (scan->parallel_scan != NULL)
@@ -2169,7 +1999,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * to be half-dead; the caller should check that condition and step left
  * again if it's important.
  */
-static Buffer
+Buffer
 _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 {
 	Page		page;
@@ -2433,7 +2263,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readpage(scan, dir, start))
+	if (!_bt_readpage(scan, dir, &start, true))
 	{
 		/*
 		 * There's no actually-matching data on this page.  Try to advance to
@@ -2462,7 +2292,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
  * _bt_initialize_more_data() -- initialize moreLeft/moreRight appropriately
  * for scan direction
  */
-static inline void
+inline void
 _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 {
 	/* initialize moreLeft/moreRight appropriately for scan direction */
@@ -2479,3 +2309,25 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 	so->numKilled = 0;			/* just paranoia */
 	so->markItemIndex = -1;		/* ditto */
 }
+
+/* Forward the call to either _bt_checkkeys, which is a simple
+ * and fastest way of checking keys, or to _bt_checkkeys_skip,
+ * which is a slower way to check the keys, but it will return extra
+ * information about whether or not we should stop reading the current page
+ * and skip. The expensive checking is only necessary when !isRegularMode, eg.
+ * when prefixDir!=postfixDir, which only happens when scanning from cursors backwards
+ */
+static inline bool
+_bt_checkkeys_extended(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+					   ScanDirection dir, bool isRegularMode,
+					   bool *continuescan, int *prefixskipindex)
+{
+	if (isRegularMode)
+	{
+		return _bt_checkkeys(scan, tuple, tupnatts, dir, continuescan, prefixskipindex);
+	}
+	else
+	{
+		return _bt_checkkeys_skip(scan, tuple, tupnatts, dir, continuescan, prefixskipindex);
+	}
+}
diff --git a/src/backend/access/nbtree/nbtskip.c b/src/backend/access/nbtree/nbtskip.c
new file mode 100644
index 0000000000..7850230b9f
--- /dev/null
+++ b/src/backend/access/nbtree/nbtskip.c
@@ -0,0 +1,1317 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtskip.c
+ *	  Search code related to skip scan for postgres btrees.
+ *
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtskip.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/relscan.h"
+#include "catalog/catalog.h"
+#include "miscadmin.h"
+#include "utils/guc.h"
+#include "storage/predicate.h"
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+
+static inline void _bt_update_scankey_with_tuple(BTScanInsert scankeys,
+											Relation indexRel, IndexTuple itup, int numattrs);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, Buffer buf);
+static inline int32 _bt_compare_until(Relation rel, BTScanInsert key, IndexTuple itup, int prefix);
+static inline void
+_bt_determine_next_action(IndexScanDesc scan, BTSkipCompareResult *cmp, OffsetNumber firstOffnum,
+						  OffsetNumber lastOffnum, ScanDirection postfixDir, BTSkipState *nextAction);
+static inline void
+_bt_determine_next_action_after_skip(BTScanOpaque so, BTSkipCompareResult *cmp, ScanDirection prefixDir,
+									 ScanDirection postfixDir, int skipped, BTSkipState *nextAction);
+static inline void
+_bt_determine_next_action_after_skip_extra(BTScanOpaque so, BTSkipCompareResult *cmp, BTSkipState *nextAction);
+static inline void _bt_copy_scankey(BTScanInsert to, BTScanInsert from, int numattrs);
+static inline IndexTuple _bt_get_tuple_from_offset(BTScanOpaque so, OffsetNumber curTupleOffnum);
+static void _bt_skip_update_scankey_after_read(IndexScanDesc scan, IndexTuple curTuple,
+											   ScanDirection prefixDir, ScanDirection postfixDir);
+static void _bt_skip_update_scankey_for_prefix_skip(IndexScanDesc scan, Relation indexRel,
+										int prefix, IndexTuple itup, ScanDirection prefixDir);
+static bool _bt_try_in_page_skip(IndexScanDesc scan, ScanDirection prefixDir);
+
+/*
+ * returns whether we're at the end of a scan.
+ * the scan position can be invalid even though we still
+ * should continue the scan. this happens for example when
+ * we're scanning with prefixDir!=postfixDir. when looking at the first
+ * prefix, we traverse the items within the prefix from max to min.
+ * if none of them match, we actually run off the start of the index,
+ * meaning none of the tuples within this prefix match. the scan pos becomes
+ * invalid, however, we do need to look further to the next prefix.
+ * therefore, this function still returns true in this particular case.
+ */
+static inline bool
+_bt_skip_is_valid(BTScanOpaque so, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	return BTScanPosIsValid(so->currPos) ||
+			(!_bt_skip_is_regular_mode(prefixDir, postfixDir) &&
+			 so->skipData->curPos.nextAction != SkipStateStop);
+}
+
+/* try finding the next tuple to skip to within the local tuple storage.
+ * local tuple storage is filled during _bt_readpage with all matching
+ * tuples on that page. if we can find the next prefix here it saves
+ * us doing a scan from root.
+ * Note that this optimization only works with _bt_regular_mode == true
+ * If this is not the case, the local tuple workspace will always only
+ * contain tuples of one specific prefix (_bt_readpage will stop at
+ * the end of a prefx)
+ */
+static bool
+_bt_try_in_page_skip(IndexScanDesc scan, ScanDirection prefixDir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPosItem *currItem;
+	BTSkip skip = so->skipData;
+	IndexTuple itup = NULL;
+	bool goback;
+	int low, high, starthigh, startlow;
+	int32		result,
+				cmpval;
+	BTScanInsert key = &so->skipData->curPos.skipScanKey;
+
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+	_bt_skip_update_scankey_for_prefix_skip(scan, scan->indexRelation, skip->prefix, itup, prefixDir);
+
+	_bt_set_bsearch_flags(key->scankeys[key->keysz - 1].sk_strategy, prefixDir, &key->nextkey, &goback);
+
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+	/* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
+
+	startlow = low = ScanDirectionIsForward(prefixDir) ? so->currPos.itemIndex + 1 : so->currPos.firstItem;
+	starthigh = high = ScanDirectionIsForward(prefixDir) ? so->currPos.lastItem : so->currPos.itemIndex - 1;
+
+	/*
+	 * If there are no keys on the page, return the first available slot. Note
+	 * this covers two cases: the page is really empty (no keys), or it
+	 * contains only a high key.  The latter case is possible after vacuuming.
+	 * This can never happen on an internal page, however, since they are
+	 * never empty (an internal page must have children).
+	 */
+	if (unlikely(high < low))
+		return false;
+
+	/*
+	 * Binary search to find the first key on the page >= scan key, or first
+	 * key > scankey when nextkey is true.
+	 *
+	 * For nextkey=false (cmpval=1), the loop invariant is: all slots before
+	 * 'low' are < scan key, all slots at or after 'high' are >= scan key.
+	 *
+	 * For nextkey=true (cmpval=0), the loop invariant is: all slots before
+	 * 'low' are <= scan key, all slots at or after 'high' are > scan key.
+	 *
+	 * We can fall out when high == low.
+	 */
+	high++;						/* establish the loop invariant for high */
+
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
+
+	while (high > low)
+	{
+		int mid = low + ((high - low) / 2);
+
+		/* We have low <= mid < high, so mid points at a real slot */
+
+		currItem = &so->currPos.items[mid];
+		itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+		result = _bt_compare_until(scan->indexRelation, key, itup, skip->prefix);
+
+		if (result >= cmpval)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	if (high > starthigh)
+		return false;
+
+	if (goback)
+	{
+		low--;
+		if (low < startlow)
+			return false;
+	}
+
+	so->currPos.itemIndex = low;
+
+	return true;
+}
+
+/*
+ *  _bt_skip() -- Skip items that have the same prefix as the most recently
+ * 				  fetched index tuple.
+ *
+ * in: pinned, not locked
+ * out: pinned, not locked (unless end of scan, then unpinned)
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPosItem *currItem;
+	IndexTuple itup = NULL;
+	OffsetNumber curTupleOffnum = InvalidOffsetNumber;
+	BTSkipCompareResult cmp;
+	BTSkip skip = so->skipData;
+	OffsetNumber first;
+
+	/* in page skip only works when prefixDir == postfixDir */
+	if (!_bt_skip_is_regular_mode(prefixDir, postfixDir) || !_bt_try_in_page_skip(scan, prefixDir))
+	{
+		currItem = &so->currPos.items[so->currPos.itemIndex];
+		itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+		so->skipData->curPos.nextSkipIndex = so->skipData->prefix;
+		_bt_skip_once(scan, &itup, &curTupleOffnum, true, prefixDir, postfixDir);
+		_bt_skip_until_match(scan, &itup, &curTupleOffnum, prefixDir, postfixDir);
+		if (!_bt_skip_is_always_valid(so))
+			return false;
+
+		first = curTupleOffnum;
+		_bt_readpage(scan, postfixDir, &curTupleOffnum, _bt_skip_is_regular_mode(prefixDir, postfixDir));
+		if (DEBUG1 >= log_min_messages || DEBUG1 >= client_min_messages)
+		{
+			print_itup(BufferGetBlockNumber(so->currPos.buf), _bt_get_tuple_from_offset(so, first), NULL, scan->indexRelation,
+						"first item on page compared after skip");
+			print_itup(BufferGetBlockNumber(so->currPos.buf), _bt_get_tuple_from_offset(so, curTupleOffnum), NULL, scan->indexRelation,
+						"last item on page compared after skip");
+		}
+		_bt_compare_current_item(scan, _bt_get_tuple_from_offset(so, curTupleOffnum),
+								 IndexRelationGetNumberOfAttributes(scan->indexRelation),
+								 postfixDir, _bt_skip_is_regular_mode(prefixDir, postfixDir), &cmp);
+		_bt_determine_next_action(scan, &cmp, first, curTupleOffnum, postfixDir, &skip->curPos.nextAction);
+		skip->curPos.nextDirection = prefixDir;
+		skip->curPos.nextSkipIndex = cmp.prefixSkipIndex;
+		_bt_skip_update_scankey_after_read(scan, _bt_get_tuple_from_offset(so, curTupleOffnum), prefixDir, postfixDir);
+
+		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+	}
+
+	/* prepare for the call to _bt_next, because _bt_next increments this to get to the tuple we want to be at */
+	if (ScanDirectionIsForward(postfixDir))
+		so->currPos.itemIndex--;
+	else
+		so->currPos.itemIndex++;
+
+	return true;
+}
+
+static IndexTuple
+_bt_get_tuple_from_offset(BTScanOpaque so, OffsetNumber curTupleOffnum)
+{
+	Page page = BufferGetPage(so->currPos.buf);
+	return (IndexTuple) PageGetItem(page, PageGetItemId(page, curTupleOffnum));
+}
+
+static void
+_bt_determine_next_action(IndexScanDesc scan, BTSkipCompareResult *cmp, OffsetNumber firstOffnum, OffsetNumber lastOffnum, ScanDirection postfixDir, BTSkipState *nextAction)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	if (cmp->fullKeySkip)
+		*nextAction = SkipStateStop;
+	else if (ScanDirectionIsForward(postfixDir))
+	{
+		OffsetNumber firstItem = firstOffnum, lastItem = lastOffnum;
+		if (cmp->prefixSkip)
+		{
+			*nextAction = SkipStateSkip;
+		}
+		else
+		{
+			IndexTuple toCmp;
+			if (so->currPos.lastItem >= so->currPos.firstItem)
+				toCmp = _bt_get_tuple_from_offset(so, so->currPos.items[so->currPos.lastItem].indexOffset);
+			else
+				toCmp = _bt_get_tuple_from_offset(so, firstItem);
+			_bt_update_scankey_with_tuple(&so->skipData->currentTupleKey,
+										  scan->indexRelation, toCmp, RelationGetNumberOfAttributes(scan->indexRelation));
+			if (_bt_has_extra_quals_after_skip(so->skipData, postfixDir, so->skipData->prefix) && !cmp->equal &&
+					(cmp->prefixCmpResult != 0 ||
+					 _bt_compare_until(scan->indexRelation, &so->skipData->currentTupleKey,
+									   _bt_get_tuple_from_offset(so, lastItem), so->skipData->prefix) != 0))
+				*nextAction = SkipStateSkipExtra;
+			else
+				*nextAction = SkipStateNext;
+		}
+	}
+	else
+	{
+		OffsetNumber firstItem = lastOffnum, lastItem = firstOffnum;
+		if (cmp->prefixSkip)
+		{
+			*nextAction = SkipStateSkip;
+		}
+		else
+		{
+			IndexTuple toCmp;
+			if (so->currPos.lastItem >= so->currPos.firstItem)
+				toCmp = _bt_get_tuple_from_offset(so, so->currPos.items[so->currPos.firstItem].indexOffset);
+			else
+				toCmp = _bt_get_tuple_from_offset(so, lastItem);
+			_bt_update_scankey_with_tuple(&so->skipData->currentTupleKey,
+										  scan->indexRelation, toCmp, RelationGetNumberOfAttributes(scan->indexRelation));
+			if (_bt_has_extra_quals_after_skip(so->skipData, postfixDir, so->skipData->prefix) && !cmp->equal &&
+					(cmp->prefixCmpResult != 0 ||
+					 _bt_compare_until(scan->indexRelation, &so->skipData->currentTupleKey,
+									   _bt_get_tuple_from_offset(so, firstItem), so->skipData->prefix) != 0))
+				*nextAction = SkipStateSkipExtra;
+			else
+				*nextAction = SkipStateNext;
+		}
+	}
+}
+
+static inline bool
+_bt_should_prefix_skip(BTSkipCompareResult *cmp)
+{
+	return cmp->prefixSkip || cmp->prefixCmpResult != 0;
+}
+
+static inline void
+_bt_determine_next_action_after_skip(BTScanOpaque so, BTSkipCompareResult *cmp, ScanDirection prefixDir,
+									 ScanDirection postfixDir, int skipped, BTSkipState *nextAction)
+{
+	if (!_bt_skip_is_always_valid(so) || cmp->fullKeySkip)
+		*nextAction = SkipStateStop;
+	else if (cmp->equal && _bt_skip_is_regular_mode(prefixDir, postfixDir))
+		*nextAction = SkipStateNext;
+	else if (_bt_should_prefix_skip(cmp) && _bt_skip_is_regular_mode(prefixDir, postfixDir) &&
+			 ((ScanDirectionIsForward(prefixDir) && cmp->skCmpResult == -1) ||
+			  (ScanDirectionIsBackward(prefixDir) && cmp->skCmpResult == 1)))
+		*nextAction = SkipStateSkip;
+	else if (!_bt_skip_is_regular_mode(prefixDir, postfixDir) ||
+			 _bt_has_extra_quals_after_skip(so->skipData, postfixDir, skipped) ||
+			 cmp->prefixCmpResult != 0)
+		*nextAction = SkipStateSkipExtra;
+	else
+		*nextAction = SkipStateNext;
+}
+
+static inline void
+_bt_determine_next_action_after_skip_extra(BTScanOpaque so, BTSkipCompareResult *cmp, BTSkipState *nextAction)
+{
+	if (!_bt_skip_is_always_valid(so) || cmp->fullKeySkip)
+		*nextAction = SkipStateStop;
+	else if (cmp->equal)
+		*nextAction = SkipStateNext;
+	else if (_bt_should_prefix_skip(cmp))
+		*nextAction = SkipStateSkip;
+	else
+		*nextAction = SkipStateNext;
+}
+
+/* just a debug function that prints a scankey. will be removed for final patch */
+static inline void
+_print_skey(IndexScanDesc scan, BTScanInsert scanKey)
+{
+	Oid			typOutput;
+	bool		varlenatype;
+	char	   *val;
+	int i;
+	Relation rel = scan->indexRelation;
+
+	for (i = 0; i < scanKey->keysz; i++)
+	{
+		ScanKey cur = &scanKey->scankeys[i];
+		if (!IsCatalogRelation(rel))
+		{
+			if (!(cur->sk_flags & SK_ISNULL))
+			{
+				if (cur->sk_subtype != InvalidOid)
+					getTypeOutputInfo(cur->sk_subtype,
+									  &typOutput, &varlenatype);
+				else
+					getTypeOutputInfo(rel->rd_opcintype[i],
+									  &typOutput, &varlenatype);
+				val = OidOutputFunctionCall(typOutput, cur->sk_argument);
+				if (val)
+				{
+					elog(DEBUG1, "%s sk attr %d val: %s (%s, %s)",
+						 RelationGetRelationName(rel), i, val,
+						 (cur->sk_flags & SK_BT_NULLS_FIRST) != 0 ? "NULLS FIRST" : "NULLS LAST",
+						 (cur->sk_flags & SK_BT_DESC) != 0 ? "DESC" : "ASC");
+					pfree(val);
+				}
+			}
+			else
+			{
+				elog(DEBUG1, "%s sk attr %d val: NULL (%s, %s)",
+					 RelationGetRelationName(rel), i,
+					 (cur->sk_flags & SK_BT_NULLS_FIRST) != 0 ? "NULLS FIRST" : "NULLS LAST",
+					 (cur->sk_flags & SK_BT_DESC) != 0 ? "DESC" : "ASC");
+			}
+		}
+	}
+}
+
+bool
+_bt_checkkeys_skip(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+				   ScanDirection dir, bool *continuescan, int *prefixskipindex)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+
+	bool match = _bt_checkkeys(scan, tuple, tupnatts, dir, continuescan, prefixskipindex);
+	int prefixCmpResult = _bt_compare_until(scan->indexRelation, &skip->curPos.skipScanKey, tuple, skip->prefix);
+	if (*prefixskipindex == -1 && prefixCmpResult != 0)
+	{
+		*prefixskipindex = skip->prefix;
+		return false;
+	}
+	else
+	{
+		bool newcont;
+		_bt_checkkeys_threeway(scan, tuple, tupnatts, dir, &newcont, prefixskipindex);
+		if (*prefixskipindex == -1 && prefixCmpResult != 0)
+		{
+			*prefixskipindex = skip->prefix;
+			return false;
+		}
+	}
+	return match;
+}
+
+/*
+ * Compare a scankey with a given tuple but only the first prefix columns
+ * This function returns 0 if the first 'prefix' columns are equal
+ * -1 if key < itup for the first prefix columns
+ * 1 if key > itup for the first prefix columns
+ */
+int32
+_bt_compare_until(Relation rel,
+			BTScanInsert key,
+			IndexTuple itup,
+			int prefix)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ScanKey		scankey;
+	int			ncmpkey;
+
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+
+	ncmpkey = Min(prefix, key->keysz);
+	scankey = key->scankeys;
+	for (int i = 1; i <= ncmpkey; i++)
+	{
+		Datum		datum;
+		bool		isNull;
+		int32		result;
+
+		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+
+		/* see comments about NULLs handling in btbuild */
+		if (scankey->sk_flags & SK_ISNULL)	/* key is NULL */
+		{
+			if (isNull)
+				result = 0;		/* NULL "=" NULL */
+			else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+				result = -1;	/* NULL "<" NOT_NULL */
+			else
+				result = 1;		/* NULL ">" NOT_NULL */
+		}
+		else if (isNull)		/* key is NOT_NULL and item is NULL */
+		{
+			if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+				result = 1;		/* NOT_NULL ">" NULL */
+			else
+				result = -1;	/* NOT_NULL "<" NULL */
+		}
+		else
+		{
+			/*
+			 * The sk_func needs to be passed the index value as left arg and
+			 * the sk_argument as right arg (they might be of different
+			 * types).  Since it is convenient for callers to think of
+			 * _bt_compare as comparing the scankey to the index item, we have
+			 * to flip the sign of the comparison result.  (Unless it's a DESC
+			 * column, in which case we *don't* flip the sign.)
+			 */
+			result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+													 scankey->sk_collation,
+													 datum,
+													 scankey->sk_argument));
+
+			if (!(scankey->sk_flags & SK_BT_DESC))
+				INVERT_COMPARE_RESULT(result);
+		}
+
+		/* if the keys are unequal, return the difference */
+		if (result != 0)
+			return result;
+
+		scankey++;
+	}
+	return 0;
+}
+
+
+/*
+ * Create initial scankeys for skipping and stores them in the skipData
+ * structure
+ */
+void
+_bt_skip_create_scankeys(Relation rel, BTScanOpaque so)
+{
+	int keysCount;
+	BTSkip skip = so->skipData;
+	StrategyNumber stratTotal;
+	ScanKey		keyPointers[INDEX_MAX_KEYS];
+	bool goback;
+	/* we need to create both forward and backward keys because the scan direction
+	 * may change at any moment in scans with a cursor.
+	 * we could technically delay creation of the second until first use as an optimization
+	 * but that is not implemented yet.
+	 */
+	keysCount = _bt_choose_scan_keys(so->keyData, so->numberOfKeys, ForwardScanDirection,
+									 keyPointers, skip->fwdNotNullKeys, &stratTotal, skip->prefix);
+	_bt_create_insertion_scan_key(rel, ForwardScanDirection, keyPointers, keysCount,
+								  &skip->fwdScanKey, &stratTotal, &goback);
+
+	keysCount = _bt_choose_scan_keys(so->keyData, so->numberOfKeys, BackwardScanDirection,
+									 keyPointers, skip->bwdNotNullKeys, &stratTotal, skip->prefix);
+	_bt_create_insertion_scan_key(rel, BackwardScanDirection, keyPointers, keysCount,
+								  &skip->bwdScanKey, &stratTotal, &goback);
+
+	_bt_metaversion(rel, &skip->curPos.skipScanKey.heapkeyspace,
+					&skip->curPos.skipScanKey.allequalimage);
+	skip->curPos.skipScanKey.anynullkeys = false; /* unused */
+	skip->curPos.skipScanKey.nextkey = false;
+	skip->curPos.skipScanKey.pivotsearch = false;
+	skip->curPos.skipScanKey.scantid = NULL;
+	skip->curPos.skipScanKey.keysz = 0;
+
+	/* setup scankey for the current tuple as well. it's not necessarily that
+	 * we will use the data from the current tuple already,
+	 * but we need the rest of the data structure to be set up correctly
+	 * for when we use it to create skip->curPos.skipScanKey keys later
+	 */
+	_bt_mkscankey(rel, NULL, &skip->currentTupleKey);
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * 								within a page, specified by the buffer.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+						Buffer buf)
+{
+	/* @todo: optimization is still possible here to
+	 * only check either the low or the high, depending on
+	 * which direction *we came from* AND which direction
+	 * *we are planning to scan*
+	 */
+	OffsetNumber low, high;
+	Page page = BufferGetPage(buf);
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	int			ans_lo, ans_hi;
+
+	low = P_FIRSTDATAKEY(opaque);
+	high = PageGetMaxOffsetNumber(page);
+
+	if (unlikely(high < low))
+		return false;
+
+	ans_lo = _bt_compare(scan->indexRelation,
+					   key, page, low);
+	ans_hi = _bt_compare(scan->indexRelation,
+					   key, page, high);
+	if (key->nextkey)
+	{
+		/* sk < last && sk >= first */
+		return ans_lo >= 0 && ans_hi == -1;
+	}
+	else
+	{
+		/* sk <= last && sk > first */
+		return ans_lo == 1 && ans_hi <= 0;
+	}
+}
+
+/* in: pinned and locked, out: pinned and locked (unless end of scan) */
+static void
+_bt_skip_find(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+			  BTScanInsert scanKey, ScanDirection dir)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	OffsetNumber offnum;
+	BTStack stack;
+	Buffer buf;
+	bool goback;
+	Page		page;
+	BTPageOpaque opaque;
+	OffsetNumber minoff;
+	Relation rel = scan->indexRelation;
+	bool fromroot = true;
+
+	_bt_set_bsearch_flags(scanKey->scankeys[scanKey->keysz - 1].sk_strategy, dir, &scanKey->nextkey, &goback);
+
+	if ((DEBUG1 >= log_min_messages || DEBUG1 >= client_min_messages) && !IsCatalogRelation(rel))
+	{
+		if (*curTuple != NULL)
+			print_itup(BufferGetBlockNumber(so->currPos.buf), *curTuple, NULL, rel,
+						"before btree search");
+
+		elog(DEBUG1, "%s searching tree with %d keys, nextkey=%d, goback=%d",
+			 RelationGetRelationName(rel), scanKey->keysz, scanKey->nextkey,
+			 goback);
+
+		_print_skey(scan, scanKey);
+	}
+
+	if (*curTupleOffnum == InvalidOffsetNumber)
+	{
+		BTScanPosUnpinIfPinned(so->currPos);
+	}
+	else
+	{
+		if (_bt_scankey_within_page(scan, scanKey, so->currPos.buf))
+		{
+			elog(DEBUG1, "sk found within current page");
+
+			offnum = _bt_binsrch(scan->indexRelation, scanKey, so->currPos.buf);
+			fromroot = false;
+		}
+		else
+		{
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+			ReleaseBuffer(so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+		}
+	}
+
+	/*
+	 * We haven't found scan key within the current page, so let's scan from
+	 * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+	 * number
+	 */
+	if (fromroot)
+	{
+		stack = _bt_search(scan->indexRelation, scanKey,
+						   &buf, BT_READ, scan->xs_snapshot);
+		_bt_freestack(stack);
+		so->currPos.buf = buf;
+
+		offnum = _bt_binsrch(scan->indexRelation, scanKey, buf);
+
+		/* Lock the page for SERIALIZABLE transactions */
+		PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+						  scan->xs_snapshot);
+	}
+
+	page = BufferGetPage(so->currPos.buf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	if (goback)
+	{
+		offnum = OffsetNumberPrev(offnum);
+		minoff = P_FIRSTDATAKEY(opaque);
+		if (offnum < minoff)
+		{
+			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+			if (!_bt_step_back_page(scan, curTuple, curTupleOffnum))
+				return;
+			page = BufferGetPage(so->currPos.buf);
+			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			offnum = PageGetMaxOffsetNumber(page);
+		}
+	}
+	else if (offnum > PageGetMaxOffsetNumber(page))
+	{
+		BlockNumber next = opaque->btpo_next;
+		LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+		if (!_bt_step_forward_page(scan, next, curTuple, curTupleOffnum))
+			return;
+		page = BufferGetPage(so->currPos.buf);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		offnum = P_FIRSTDATAKEY(opaque);
+	}
+
+	/* We know in which direction to look */
+	_bt_initialize_more_data(so, dir);
+
+	*curTupleOffnum = offnum;
+	*curTuple = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	if (DEBUG1 >= log_min_messages || DEBUG1 >= client_min_messages)
+		print_itup(BufferGetBlockNumber(so->currPos.buf), *curTuple, NULL, rel,
+					"after btree search");
+}
+
+static inline bool
+_bt_step_one_page(IndexScanDesc scan, ScanDirection dir, IndexTuple *curTuple,
+				  OffsetNumber *curTupleOffnum)
+{
+	if (ScanDirectionIsForward(dir))
+	{
+		BTScanOpaque so = (BTScanOpaque) scan->opaque;
+		return _bt_step_forward_page(scan, so->currPos.nextPage, curTuple, curTupleOffnum);
+	}
+	else
+	{
+		return _bt_step_back_page(scan, curTuple, curTupleOffnum);
+	}
+}
+
+/* in: possibly pinned, but unlocked, out: pinned and locked */
+bool
+_bt_step_forward_page(IndexScanDesc scan, BlockNumber next, IndexTuple *curTuple,
+					  OffsetNumber *curTupleOffnum)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation rel = scan->indexRelation;
+	BlockNumber blkno = next;
+	Page page;
+	BTPageOpaque opaque;
+
+	Assert(BTScanPosIsValid(so->currPos));
+
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_bt_killitems(scan);
+
+	/*
+	 * Before we modify currPos, make a copy of the page data if there was a
+	 * mark position that needs it.
+	 */
+	if (so->markItemIndex >= 0)
+	{
+		/* bump pin on current buffer for assignment to mark buffer */
+		if (BTScanPosIsPinned(so->currPos))
+			IncrBufferRefCount(so->currPos.buf);
+		memcpy(&so->markPos, &so->currPos,
+			   offsetof(BTScanPosData, items[1]) +
+			   so->currPos.lastItem * sizeof(BTScanPosItem));
+		if (so->markTuples)
+			memcpy(so->markTuples, so->currTuples,
+				   so->currPos.nextTupleOffset);
+		so->markPos.itemIndex = so->markItemIndex;
+		if (so->skipData)
+			memcpy(&so->skipData->markPos, &so->skipData->curPos,
+				   sizeof(BTSkipPosData));
+		so->markItemIndex = -1;
+	}
+
+	/* Remember we left a page with data */
+	so->currPos.moreLeft = true;
+
+	/* release the previous buffer, if pinned */
+	BTScanPosUnpinIfPinned(so->currPos);
+
+	{
+		for (;;)
+		{
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
+			if (blkno == P_NONE)
+			{
+				_bt_parallel_done(scan);
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+
+			/* check for interrupts while we're not holding any buffer lock */
+			CHECK_FOR_INTERRUPTS();
+			/* step right one page */
+			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			page = BufferGetPage(so->currPos.buf);
+			TestForOldSnapshot(scan->xs_snapshot, rel, page);
+			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			/* check for deleted page */
+			if (!P_IGNORE(opaque))
+			{
+				PredicateLockPage(rel, blkno, scan->xs_snapshot);
+				*curTupleOffnum = P_FIRSTDATAKEY(opaque);
+				*curTuple = _bt_get_tuple_from_offset(so, *curTupleOffnum);
+				break;
+			}
+
+			blkno = opaque->btpo_next;
+			_bt_relbuf(rel, so->currPos.buf);
+		}
+	}
+
+	return true;
+}
+
+/* in: possibly pinned, but unlocked, out: pinned and locked */
+bool
+_bt_step_back_page(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(BTScanPosIsValid(so->currPos));
+
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_bt_killitems(scan);
+
+	/*
+	 * Before we modify currPos, make a copy of the page data if there was a
+	 * mark position that needs it.
+	 */
+	if (so->markItemIndex >= 0)
+	{
+		/* bump pin on current buffer for assignment to mark buffer */
+		if (BTScanPosIsPinned(so->currPos))
+			IncrBufferRefCount(so->currPos.buf);
+		memcpy(&so->markPos, &so->currPos,
+			   offsetof(BTScanPosData, items[1]) +
+			   so->currPos.lastItem * sizeof(BTScanPosItem));
+		if (so->markTuples)
+			memcpy(so->markTuples, so->currTuples,
+				   so->currPos.nextTupleOffset);
+		if (so->skipData)
+			memcpy(&so->skipData->markPos, &so->skipData->curPos,
+				   sizeof(BTSkipPosData));
+		so->markPos.itemIndex = so->markItemIndex;
+		so->markItemIndex = -1;
+	}
+
+	/* Remember we left a page with data */
+	so->currPos.moreRight = true;
+
+	/* Not parallel, so just use our own notion of the current page */
+
+	{
+		Relation	rel;
+		Page		page;
+		BTPageOpaque opaque;
+
+		rel = scan->indexRelation;
+
+		if (BTScanPosIsPinned(so->currPos))
+			LockBuffer(so->currPos.buf, BT_READ);
+		else
+			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+
+		for (;;)
+		{
+			/* Step to next physical page */
+			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
+											scan->xs_snapshot);
+
+			/* if we're physically at end of index, return failure */
+			if (so->currPos.buf == InvalidBuffer)
+			{
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+
+			/*
+			 * Okay, we managed to move left to a non-deleted page. Done if
+			 * it's not half-dead and contains matching tuples. Else loop back
+			 * and do it all again.
+			 */
+			page = BufferGetPage(so->currPos.buf);
+			TestForOldSnapshot(scan->xs_snapshot, rel, page);
+			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			if (!P_IGNORE(opaque))
+			{
+				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
+				*curTupleOffnum = PageGetMaxOffsetNumber(page);
+				*curTuple = _bt_get_tuple_from_offset(so, *curTupleOffnum);
+				break;
+			}
+		}
+	}
+
+	return true;
+}
+
+/* holds lock as long as curTupleOffnum != InvalidOffsetNumber */
+bool
+_bt_skip_find_next(IndexScanDesc scan, IndexTuple curTuple, OffsetNumber curTupleOffnum,
+				   ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	BTSkipCompareResult cmp;
+
+	while (_bt_skip_is_valid(so, prefixDir, postfixDir))
+	{
+		bool found;
+		_bt_skip_until_match(scan, &curTuple, &curTupleOffnum, prefixDir, postfixDir);
+
+		while (_bt_skip_is_always_valid(so))
+		{
+			OffsetNumber first = curTupleOffnum;
+			found = _bt_readpage(scan, postfixDir, &curTupleOffnum,
+								 _bt_skip_is_regular_mode(prefixDir, postfixDir));
+			if (DEBUG1 >= log_min_messages || DEBUG1 >= client_min_messages)
+			{
+				print_itup(BufferGetBlockNumber(so->currPos.buf),
+						   _bt_get_tuple_from_offset(so, first), NULL, scan->indexRelation,
+							"first item on page compared");
+				print_itup(BufferGetBlockNumber(so->currPos.buf),
+						   _bt_get_tuple_from_offset(so, curTupleOffnum), NULL, scan->indexRelation,
+							"last item on page compared");
+			}
+			_bt_compare_current_item(scan, _bt_get_tuple_from_offset(so, curTupleOffnum),
+									 IndexRelationGetNumberOfAttributes(scan->indexRelation),
+									 postfixDir, _bt_skip_is_regular_mode(prefixDir, postfixDir), &cmp);
+			_bt_determine_next_action(scan, &cmp, first, curTupleOffnum,
+									  postfixDir, &skip->curPos.nextAction);
+			skip->curPos.nextDirection = prefixDir;
+			skip->curPos.nextSkipIndex = cmp.prefixSkipIndex;
+
+			if (found)
+			{
+				_bt_skip_update_scankey_after_read(scan, _bt_get_tuple_from_offset(so, curTupleOffnum),
+												   prefixDir, postfixDir);
+				return true;
+			}
+			else if (skip->curPos.nextAction == SkipStateNext)
+			{
+				if (curTupleOffnum != InvalidOffsetNumber)
+					LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+				if (!_bt_step_one_page(scan, postfixDir, &curTuple, &curTupleOffnum))
+					return false;
+			}
+			else if (skip->curPos.nextAction == SkipStateSkip || skip->curPos.nextAction == SkipStateSkipExtra)
+			{
+				curTuple = _bt_get_tuple_from_offset(so, curTupleOffnum);
+				_bt_skip_update_scankey_after_read(scan, curTuple, prefixDir, postfixDir);
+				LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+				curTupleOffnum = InvalidOffsetNumber;
+				curTuple = NULL;
+				break;
+			}
+			else if (skip->curPos.nextAction == SkipStateStop)
+			{
+				LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+				BTScanPosUnpinIfPinned(so->currPos);
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+			else
+			{
+				Assert(false);
+			}
+		}
+	}
+	return false;
+}
+
+void
+_bt_skip_until_match(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+					 ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	while (_bt_skip_is_valid(so, prefixDir, postfixDir) &&
+		   (skip->curPos.nextAction == SkipStateSkip || skip->curPos.nextAction == SkipStateSkipExtra))
+	{
+		_bt_skip_once(scan, curTuple, curTupleOffnum,
+					  skip->curPos.nextAction == SkipStateSkip, prefixDir, postfixDir);
+	}
+}
+
+void
+_bt_compare_current_item(IndexScanDesc scan, IndexTuple tuple, int tupnatts, ScanDirection dir,
+						 bool isRegularMode, BTSkipCompareResult* cmp)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+
+	if (_bt_skip_is_always_valid(so))
+	{
+		bool continuescan = true;
+
+		cmp->equal = _bt_checkkeys(scan, tuple, tupnatts, dir, &continuescan, &cmp->prefixSkipIndex);
+		cmp->fullKeySkip = !continuescan;
+		/* prefix can be smaller than scankey due to extra quals being added
+		 * therefore we need to compare both. @todo this can be optimized into one function call */
+		cmp->prefixCmpResult = _bt_compare_until(scan->indexRelation, &skip->curPos.skipScanKey, tuple, skip->prefix);
+		cmp->skCmpResult = _bt_compare_until(scan->indexRelation,
+											 &skip->curPos.skipScanKey, tuple, skip->curPos.skipScanKey.keysz);
+		if (cmp->prefixSkipIndex == -1)
+		{
+			cmp->prefixSkipIndex = skip->prefix;
+			cmp->prefixSkip = ScanDirectionIsForward(dir) ? cmp->prefixCmpResult < 0 : cmp->prefixCmpResult > 0;
+		}
+		else
+		{
+			int newskip = -1;
+			_bt_checkkeys_threeway(scan, tuple, tupnatts, dir, &continuescan, &newskip);
+			if (newskip != -1)
+			{
+				cmp->prefixSkip = true;
+				cmp->prefixSkipIndex = newskip;
+			}
+			else
+			{
+				cmp->prefixSkip = ScanDirectionIsForward(dir) ? cmp->prefixCmpResult < 0 : cmp->prefixCmpResult > 0;
+				cmp->prefixSkipIndex = skip->prefix;
+			}
+		}
+
+		if (DEBUG1 >= log_min_messages || DEBUG1 >= client_min_messages)
+		{
+			print_itup(BufferGetBlockNumber(so->currPos.buf), tuple, NULL, scan->indexRelation,
+						"compare item");
+			_print_skey(scan, &skip->curPos.skipScanKey);
+			elog(DEBUG1, "result: eq: %d fkskip: %d pfxskip: %d prefixcmpres: %d prefixskipidx: %d", cmp->equal, cmp->fullKeySkip,
+				 _bt_should_prefix_skip(cmp), cmp->prefixCmpResult, cmp->prefixSkipIndex);
+		}
+	}
+	else
+	{
+		/* we cannot stop the scan if !isRegularMode - then we do need to skip to the next prefix */
+		cmp->fullKeySkip = isRegularMode;
+		cmp->equal = false;
+		cmp->prefixCmpResult = -2;
+		cmp->prefixSkip = true;
+		cmp->prefixSkipIndex = skip->prefix;
+		cmp->skCmpResult = -2;
+	}
+}
+
+void
+_bt_skip_once(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+			  bool forceSkip, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	BTSkipCompareResult cmp;
+	bool doskip = forceSkip;
+	int skipIndex = skip->curPos.nextSkipIndex;
+	skip->curPos.nextAction = SkipStateSkipExtra;
+
+	while (doskip)
+	{
+		int toskip = skipIndex;
+		if (*curTuple != NULL)
+		{
+			if (skip->prefix <= skipIndex || !_bt_skip_is_regular_mode(prefixDir, postfixDir))
+			{
+				toskip = skip->prefix;
+			}
+
+			_bt_skip_update_scankey_for_prefix_skip(scan, scan->indexRelation,
+													toskip, *curTuple, prefixDir);
+		}
+
+		_bt_skip_find(scan, curTuple, curTupleOffnum, &skip->curPos.skipScanKey, prefixDir);
+
+		if (_bt_skip_is_always_valid(so))
+		{
+			_bt_skip_update_scankey_for_extra_skip(scan, scan->indexRelation,
+												   prefixDir, prefixDir, true, *curTuple);
+			_bt_compare_current_item(scan, *curTuple,
+									 IndexRelationGetNumberOfAttributes(scan->indexRelation),
+									 prefixDir,
+									 _bt_skip_is_regular_mode(prefixDir, postfixDir), &cmp);
+			skipIndex = cmp.prefixSkipIndex;
+			_bt_determine_next_action_after_skip(so, &cmp, prefixDir,
+												 postfixDir, toskip, &skip->curPos.nextAction);
+		}
+		else
+		{
+			skip->curPos.nextAction = SkipStateStop;
+		}
+		doskip = skip->curPos.nextAction == SkipStateSkip;
+	}
+	if (skip->curPos.nextAction != SkipStateStop && skip->curPos.nextAction != SkipStateNext)
+		_bt_skip_extra_conditions(scan, curTuple, curTupleOffnum, prefixDir, postfixDir, &cmp);
+}
+
+void
+_bt_skip_extra_conditions(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+						  ScanDirection prefixDir, ScanDirection postfixDir, BTSkipCompareResult *cmp)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	bool regularMode = _bt_skip_is_regular_mode(prefixDir, postfixDir);
+	if (_bt_skip_is_always_valid(so))
+	{
+		do
+		{
+			if (*curTuple != NULL)
+				_bt_skip_update_scankey_for_extra_skip(scan, scan->indexRelation,
+													   postfixDir, prefixDir, false, *curTuple);
+			_bt_skip_find(scan, curTuple, curTupleOffnum, &skip->curPos.skipScanKey, postfixDir);
+			_bt_compare_current_item(scan, *curTuple,
+									 IndexRelationGetNumberOfAttributes(scan->indexRelation),
+									 postfixDir, _bt_skip_is_regular_mode(prefixDir, postfixDir), cmp);
+		} while (regularMode && cmp->prefixCmpResult != 0 && !cmp->equal && !cmp->fullKeySkip);
+		skip->curPos.nextSkipIndex = cmp->prefixSkipIndex;
+	}
+	_bt_determine_next_action_after_skip_extra(so, cmp, &skip->curPos.nextAction);
+}
+
+static void
+_bt_skip_update_scankey_after_read(IndexScanDesc scan, IndexTuple curTuple,
+								   ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	if (skip->curPos.nextAction == SkipStateSkip)
+	{
+		int toskip = skip->curPos.nextSkipIndex;
+		if (skip->prefix <= skip->curPos.nextSkipIndex ||
+				!_bt_skip_is_regular_mode(prefixDir, postfixDir))
+		{
+			toskip = skip->prefix;
+		}
+
+		if (_bt_skip_is_regular_mode(prefixDir, postfixDir))
+			_bt_skip_update_scankey_for_prefix_skip(scan, scan->indexRelation,
+													toskip, curTuple, prefixDir);
+		else
+			_bt_skip_update_scankey_for_prefix_skip(scan, scan->indexRelation,
+													toskip, NULL, prefixDir);
+	}
+	else if (skip->curPos.nextAction == SkipStateSkipExtra)
+	{
+		_bt_skip_update_scankey_for_extra_skip(scan, scan->indexRelation,
+											   postfixDir, prefixDir, false, curTuple);
+	}
+}
+
+static inline int
+_bt_compare_one(ScanKey scankey, Datum datum2, bool isNull2)
+{
+	int32		result;
+	Datum datum1 = scankey->sk_argument;
+	bool isNull1 = scankey->sk_flags & SK_ISNULL;
+	/* see comments about NULLs handling in btbuild */
+	if (isNull1)	/* key is NULL */
+	{
+		if (isNull2)
+			result = 0;		/* NULL "=" NULL */
+		else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;	/* NULL "<" NOT_NULL */
+		else
+			result = 1;		/* NULL ">" NOT_NULL */
+	}
+	else if (isNull2)		/* key is NOT_NULL and item is NULL */
+	{
+		if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;		/* NOT_NULL ">" NULL */
+		else
+			result = -1;	/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * The sk_func needs to be passed the index value as left arg and
+		 * the sk_argument as right arg (they might be of different
+		 * types).  Since it is convenient for callers to think of
+		 * _bt_compare as comparing the scankey to the index item, we have
+		 * to flip the sign of the comparison result.  (Unless it's a DESC
+		 * column, in which case we *don't* flip the sign.)
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+												 scankey->sk_collation,
+												 datum2,
+												 datum1));
+
+		if (!(scankey->sk_flags & SK_BT_DESC))
+			INVERT_COMPARE_RESULT(result);
+	}
+	return result;
+}
+
+/*
+ * set up new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_scankey_with_tuple(BTScanInsert insertKey, Relation indexRel, IndexTuple itup, int numattrs)
+{
+	TupleDesc		itupdesc;
+	int				i;
+	ScanKey			scankeys = insertKey->scankeys;
+
+	insertKey->keysz = numattrs;
+	itupdesc = RelationGetDescr(indexRel);
+	for (i = 0; i < numattrs; i++)
+	{
+		Datum datum;
+		bool null;
+		int flags;
+
+		datum = index_getattr(itup, i + 1, itupdesc, &null);
+		flags = (null ? SK_ISNULL : 0) |
+				(indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+		scankeys[i].sk_flags = flags;
+		scankeys[i].sk_argument = datum;
+	}
+}
+
+/* copy the elements important to a skip from one insertion sk to another */
+static inline void
+_bt_copy_scankey(BTScanInsert to, BTScanInsert from, int numattrs)
+{
+	memcpy(to->scankeys, from->scankeys, sizeof(ScanKeyData) * (unsigned long)numattrs);
+	to->nextkey = from->nextkey;
+	to->keysz = numattrs;
+}
+
+/*
+ * Updates the existing scankey for skipping to the next prefix
+ * alwaysUsePrefix determines how many attrs the scankey will have
+ * when true, it will always have skip->prefix number of attributes,
+ * otherwise, the value can be less, which will be determined by the comparison
+ * result with the current tuple.
+ * for example, a SELECT * FROM tbl WHERE b<2, index (a,b,c) and when skipping with prefix size=2
+ * if we encounter the tuple (1,3,1) - this does not match the qual b<2. however, we also know that
+ * it is not useful to skip to any next qual with prefix=2 (eg. (1,4)), because that will definitely not
+ * match either. However, we do want to skip to eg. (2,0). Therefore, we skip over prefix=1 in this case.
+ *
+ * the provided itup may be null. this happens when we don't want to use the current tuple to update
+ * the scankey, but instead want to use the existing curPos.skipScanKey to fill currentTupleKey. this accounts
+ * for some edge cases.
+ */
+static void
+_bt_skip_update_scankey_for_prefix_skip(IndexScanDesc scan, Relation indexRel,
+										int prefix, IndexTuple itup, ScanDirection prefixDir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	/* we use skip->prefix is alwaysUsePrefix is set or if skip->prefix is smaller than whatever the
+	 * comparison result provided, such that we never skip more than skip->prefix
+	 */
+	int numattrs = prefix;
+
+	if (itup != NULL)
+	{
+		_bt_update_scankey_with_tuple(&skip->currentTupleKey, indexRel, itup, numattrs);
+		_bt_copy_scankey(&skip->curPos.skipScanKey, &skip->currentTupleKey, numattrs);
+	}
+	else
+	{
+		skip->curPos.skipScanKey.keysz = numattrs;
+		_bt_copy_scankey(&skip->currentTupleKey, &skip->curPos.skipScanKey, numattrs);
+	}
+	/* update strategy for last attribute as we will use this to determine the rest of the
+	 * rest of the flags (goback) when doing the actual tree search
+	 */
+	skip->currentTupleKey.scankeys[numattrs - 1].sk_strategy =
+			skip->curPos.skipScanKey.scankeys[numattrs - 1].sk_strategy =
+			ScanDirectionIsForward(prefixDir) ? BTGreaterStrategyNumber : BTLessStrategyNumber;
+}
+
+/* update the scankey for skipping the 'extra' conditions, opportunities
+ * that arise when we have just skipped to a new prefix and can try to skip
+ * within the prefix to the right tuple by using extra quals when available
+ *
+ * @todo as an optimization it should be possible to optimize calls to this function
+ * and to _bt_skip_update_scankey_for_prefix_skip to some more specific functions that
+ * will need to do less copying of data.
+ */
+void
+_bt_skip_update_scankey_for_extra_skip(IndexScanDesc scan, Relation indexRel, ScanDirection curDir,
+									   ScanDirection prefixDir, bool prioritizeEqual, IndexTuple itup)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	BTScanInsert toCopy;
+	int i, left, lastNonTuple = skip->prefix;
+
+	/* first make sure that currentTupleKey is correct at all times */
+	_bt_skip_update_scankey_for_prefix_skip(scan, indexRel, skip->prefix, itup, prefixDir);
+	/* then do the actual work to setup curPos.skipScanKey - distinguish between work that depends on overallDir
+	 * (those attributes between attribute number 1 and 'prefix' inclusive)
+	 * and work that depends on curDir
+	 * (those attributes between attribute number 'prefix' + 1 and fwdScanKey.keysz inclusive)
+	 */
+	if (ScanDirectionIsForward(prefixDir))
+	{
+		/*
+		 * if overallDir is Forward, we need to choose between fwdScanKey or
+		 * currentTupleKey. we need to choose the most restrictive one -
+		 * in most cases this means choosing eg. a>5 over a=2 when scanning forward,
+		 * unless prioritizeEqual is set. this is done for certain special cases
+		 */
+		for (i = 0; i < skip->prefix; i++)
+		{
+			ScanKey scankey = &skip->fwdScanKey.scankeys[i];
+			ScanKey scankeyItem = &skip->currentTupleKey.scankeys[i];
+			if (scankey->sk_attno != 0 && (_bt_compare_one(scankey, scankeyItem->sk_argument, scankeyItem->sk_flags & SK_ISNULL) > 0
+										   || (prioritizeEqual && scankey->sk_strategy == BTEqualStrategyNumber)))
+			{
+				memcpy(skip->curPos.skipScanKey.scankeys + i, scankey, sizeof(ScanKeyData));
+				lastNonTuple = i;
+			}
+			else
+			{
+				if (lastNonTuple < i)
+					break;
+				memcpy(skip->curPos.skipScanKey.scankeys + i, scankeyItem, sizeof(ScanKeyData));
+			}
+			/* for now choose equal here - it could actually be improved a bit @todo by choosing the strategy
+			 * from the scankeys, but it doesn't matter a lot
+			 */
+			skip->curPos.skipScanKey.scankeys[i].sk_strategy = BTEqualStrategyNumber;
+		}
+	}
+	else
+	{
+		/* similar for backward but in opposite direction */
+		for (i = 0; i < skip->prefix; i++)
+		{
+			ScanKey scankey = &skip->bwdScanKey.scankeys[i];
+			ScanKey scankeyItem = &skip->currentTupleKey.scankeys[i];
+			if (scankey->sk_attno != 0 && (_bt_compare_one(scankey, scankeyItem->sk_argument, scankeyItem->sk_flags & SK_ISNULL) < 0
+										   || (prioritizeEqual && scankey->sk_strategy == BTEqualStrategyNumber)))
+			{
+				memcpy(skip->curPos.skipScanKey.scankeys + i, scankey, sizeof(ScanKeyData));
+				lastNonTuple = i;
+			}
+			else
+			{
+				if (lastNonTuple < i)
+					break;
+				memcpy(skip->curPos.skipScanKey.scankeys + i, scankeyItem, sizeof(ScanKeyData));
+			}
+			skip->curPos.skipScanKey.scankeys[i].sk_strategy = BTEqualStrategyNumber;
+		}
+	}
+
+	/*
+	 * the remaining keys are the quals after the prefix
+	 */
+	if (ScanDirectionIsForward(curDir))
+		toCopy = &skip->fwdScanKey;
+	else
+		toCopy = &skip->bwdScanKey;
+
+	if (lastNonTuple >= skip->prefix - 1)
+	{
+		left = toCopy->keysz - skip->prefix;
+		if (left > 0)
+		{
+			memcpy(skip->curPos.skipScanKey.scankeys + skip->prefix, toCopy->scankeys + i, sizeof(ScanKeyData) * (unsigned long)left);
+		}
+		skip->curPos.skipScanKey.keysz = toCopy->keysz;
+	}
+	else
+	{
+		skip->curPos.skipScanKey.keysz = lastNonTuple + 1;
+	}
+}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index e6b7211136..d4616a930e 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -560,7 +560,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 7c33711a9f..74b74b5b2c 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,10 +49,10 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
-static void _bt_mark_scankey_required(ScanKey skey);
+static void _bt_mark_scankey_required(ScanKey skey, int forwardReqFlag, int backwardReqFlag);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
-								 ScanDirection dir, bool *continuescan);
+								 ScanDirection dir, bool *continuescan, int *prefixskipindex);
 static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
 						   IndexTuple firstright, BTScanInsert itup_key);
 
@@ -87,9 +87,8 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, IndexTuple itup, BTScanInsert key)
 {
-	BTScanInsert key;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
 	int			indnkeyatts;
@@ -109,8 +108,10 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	 * Truncated attributes and non-key attributes are omitted from the final
 	 * scan key.
 	 */
-	key = palloc(offsetof(BTScanInsertData, scankeys) +
-				 sizeof(ScanKeyData) * indnkeyatts);
+	if (key == NULL)
+		key = palloc(offsetof(BTScanInsertData, scankeys) +
+					 sizeof(ScanKeyData) * indnkeyatts);
+
 	if (itup)
 		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
 	else
@@ -155,7 +156,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
 									   (AttrNumber) (i + 1),
-									   InvalidStrategy,
+									   BTEqualStrategyNumber,
 									   InvalidOid,
 									   rel->rd_indcollation[i],
 									   procinfo,
@@ -745,7 +746,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	int			numberOfKeys = scan->numberOfKeys;
 	int16	   *indoption = scan->indexRelation->rd_indoption;
 	int			new_numberOfKeys;
-	int			numberOfEqualCols;
+	int			numberOfEqualCols, numberOfEqualColsSincePrefix;
 	ScanKey		inkeys;
 	ScanKey		outkeys;
 	ScanKey		cur;
@@ -754,6 +755,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	int			i,
 				j;
 	AttrNumber	attno;
+	int			prefix = 0;
 
 	/* initialize result variables */
 	so->qual_ok = true;
@@ -762,6 +764,11 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	if (numberOfKeys < 1)
 		return;					/* done if qual-less scan */
 
+	if (_bt_skip_enabled(so))
+	{
+		prefix = so->skipData->prefix;
+	}
+
 	/*
 	 * Read so->arrayKeyData if array keys are present, else scan->keyData
 	 */
@@ -786,7 +793,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		so->numberOfKeys = 1;
 		/* We can mark the qual as required if it's for first index col */
 		if (cur->sk_attno == 1)
-			_bt_mark_scankey_required(outkeys);
+			_bt_mark_scankey_required(outkeys, SK_BT_REQFWD, SK_BT_REQBKWD);
+		if (cur->sk_attno <= prefix + 1)
+			_bt_mark_scankey_required(outkeys, SK_BT_REQSKIPFWD, SK_BT_REQSKIPBKWD);
 		return;
 	}
 
@@ -795,6 +804,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	 */
 	new_numberOfKeys = 0;
 	numberOfEqualCols = 0;
+	numberOfEqualColsSincePrefix = 0;
+
 
 	/*
 	 * Initialize for processing of keys for attr 1.
@@ -830,6 +841,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		if (i == numberOfKeys || cur->sk_attno != attno)
 		{
 			int			priorNumberOfEqualCols = numberOfEqualCols;
+			int			priorNumberOfEqualColsSincePrefix = numberOfEqualColsSincePrefix;
+
 
 			/* check input keys are correctly ordered */
 			if (i < numberOfKeys && cur->sk_attno < attno)
@@ -880,6 +893,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 				}
 				/* track number of attrs for which we have "=" keys */
 				numberOfEqualCols++;
+				if (attno > prefix)
+					numberOfEqualColsSincePrefix++;
 			}
 
 			/* try to keep only one of <, <= */
@@ -929,7 +944,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 
 					memcpy(outkey, xform[j], sizeof(ScanKeyData));
 					if (priorNumberOfEqualCols == attno - 1)
-						_bt_mark_scankey_required(outkey);
+						_bt_mark_scankey_required(outkey, SK_BT_REQFWD, SK_BT_REQBKWD);
+					if (attno <= prefix || priorNumberOfEqualColsSincePrefix == attno - prefix - 1)
+						_bt_mark_scankey_required(outkey, SK_BT_REQSKIPFWD, SK_BT_REQSKIPBKWD);
 				}
 			}
 
@@ -954,7 +971,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 
 			memcpy(outkey, cur, sizeof(ScanKeyData));
 			if (numberOfEqualCols == attno - 1)
-				_bt_mark_scankey_required(outkey);
+				_bt_mark_scankey_required(outkey, SK_BT_REQFWD, SK_BT_REQBKWD);
+			if (attno <= prefix || numberOfEqualColsSincePrefix == attno - prefix - 1)
+				_bt_mark_scankey_required(outkey, SK_BT_REQSKIPFWD, SK_BT_REQSKIPBKWD);
 
 			/*
 			 * We don't support RowCompare using equality; such a qual would
@@ -997,7 +1016,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 
 				memcpy(outkey, cur, sizeof(ScanKeyData));
 				if (numberOfEqualCols == attno - 1)
-					_bt_mark_scankey_required(outkey);
+					_bt_mark_scankey_required(outkey, SK_BT_REQFWD, SK_BT_REQBKWD);
+				if (attno <= prefix || numberOfEqualColsSincePrefix == attno - prefix - 1)
+					_bt_mark_scankey_required(outkey, SK_BT_REQSKIPFWD, SK_BT_REQSKIPBKWD);
 			}
 		}
 	}
@@ -1295,7 +1316,7 @@ _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption)
  * anyway on a rescan.  Something to keep an eye on though.
  */
 static void
-_bt_mark_scankey_required(ScanKey skey)
+_bt_mark_scankey_required(ScanKey skey, int forwardReqFlag, int backwardReqFlag)
 {
 	int			addflags;
 
@@ -1303,14 +1324,14 @@ _bt_mark_scankey_required(ScanKey skey)
 	{
 		case BTLessStrategyNumber:
 		case BTLessEqualStrategyNumber:
-			addflags = SK_BT_REQFWD;
+			addflags = forwardReqFlag;
 			break;
 		case BTEqualStrategyNumber:
-			addflags = SK_BT_REQFWD | SK_BT_REQBKWD;
+			addflags = forwardReqFlag | backwardReqFlag;
 			break;
 		case BTGreaterEqualStrategyNumber:
 		case BTGreaterStrategyNumber:
-			addflags = SK_BT_REQBKWD;
+			addflags = backwardReqFlag;
 			break;
 		default:
 			elog(ERROR, "unrecognized StrategyNumber: %d",
@@ -1353,17 +1374,22 @@ _bt_mark_scankey_required(ScanKey skey)
  */
 bool
 _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan)
+			  ScanDirection dir, bool *continuescan, int *prefixSkipIndex)
 {
 	TupleDesc	tupdesc;
 	BTScanOpaque so;
 	int			keysz;
 	int			ikey;
 	ScanKey		key;
+	int pfx;
+
+	if (prefixSkipIndex == NULL)
+		prefixSkipIndex = &pfx;
 
 	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
 
 	*continuescan = true;		/* default assumption */
+	*prefixSkipIndex = -1;
 
 	tupdesc = RelationGetDescr(scan->indexRelation);
 	so = (BTScanOpaque) scan->opaque;
@@ -1392,7 +1418,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
 			if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
-									 continuescan))
+									 continuescan, prefixSkipIndex))
 				continue;
 			return false;
 		}
@@ -1429,6 +1455,13 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 					 ScanDirectionIsBackward(dir))
 				*continuescan = false;
 
+			if ((key->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+			else if ((key->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1452,6 +1485,10 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
 					*continuescan = false;
+
+				if ((key->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQSKIPBKWD)) &&
+					ScanDirectionIsBackward(dir))
+					*prefixSkipIndex = key->sk_attno - 1;
 			}
 			else
 			{
@@ -1468,6 +1505,9 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
 					*continuescan = false;
+				if ((key->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQSKIPBKWD)) &&
+									ScanDirectionIsBackward(dir))
+									*prefixSkipIndex = key->sk_attno - 1;
 			}
 
 			/*
@@ -1498,6 +1538,206 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 					 ScanDirectionIsBackward(dir))
 				*continuescan = false;
 
+			if ((key->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+			else if ((key->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+
+			/*
+			 * In any case, this indextuple doesn't match the qual.
+			 */
+			return false;
+		}
+	}
+
+	/* If we get here, the tuple passes all index quals. */
+	return true;
+}
+
+bool
+_bt_checkkeys_threeway(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+			  ScanDirection dir, bool *continuescan, int *prefixSkipIndex)
+{
+	TupleDesc	tupdesc;
+	BTScanOpaque so;
+	int			keysz;
+	int			ikey;
+	ScanKey		key;
+	int pfx;
+	BTScanInsert keys;
+
+	if (prefixSkipIndex == NULL)
+		prefixSkipIndex = &pfx;
+
+	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+
+	*continuescan = true;		/* default assumption */
+	*prefixSkipIndex = -1;
+
+	tupdesc = RelationGetDescr(scan->indexRelation);
+	so = (BTScanOpaque) scan->opaque;
+	if (ScanDirectionIsForward(dir))
+		keys = &so->skipData->bwdScanKey;
+	else
+		keys = &so->skipData->fwdScanKey;
+
+	keysz = keys->keysz;
+
+	for (key = keys->scankeys, ikey = 0; ikey < keysz; key++, ikey++)
+	{
+		Datum		datum;
+		bool		isNull;
+		int		cmpresult;
+
+		if (key->sk_attno == 0)
+			continue;
+
+		if (key->sk_attno > tupnatts)
+		{
+			/*
+			 * This attribute is truncated (must be high key).  The value for
+			 * this attribute in the first non-pivot tuple on the page to the
+			 * right could be any possible value.  Assume that truncated
+			 * attribute passes the qual.
+			 */
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
+		/* row-comparison keys need special processing */
+		Assert((key->sk_flags & SK_ROW_HEADER) == 0);
+
+		datum = index_getattr(tuple,
+							  key->sk_attno,
+							  tupdesc,
+							  &isNull);
+
+		if (key->sk_flags & SK_ISNULL)
+		{
+			/* Handle IS NULL/NOT NULL tests */
+			if (key->sk_flags & SK_SEARCHNULL)
+			{
+				if (isNull)
+					continue;	/* tuple satisfies this qual */
+			}
+			else
+			{
+				Assert(key->sk_flags & SK_SEARCHNOTNULL);
+				if (!isNull)
+					continue;	/* tuple satisfies this qual */
+			}
+
+			/*
+			 * Tuple fails this qual.  If it's a required qual for the current
+			 * scan direction, then we can conclude no further tuples will
+			 * pass, either.
+			 */
+			if ((key->sk_flags & SK_BT_REQFWD) &&
+				ScanDirectionIsForward(dir))
+				*continuescan = false;
+			else if ((key->sk_flags & SK_BT_REQBKWD) &&
+					 ScanDirectionIsBackward(dir))
+				*continuescan = false;
+
+			if ((key->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+			else if ((key->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+
+			/*
+			 * In any case, this indextuple doesn't match the qual.
+			 */
+			return false;
+		}
+
+		if (isNull)
+		{
+			if (key->sk_flags & SK_BT_NULLS_FIRST)
+			{
+				/*
+				 * Since NULLs are sorted before non-NULLs, we know we have
+				 * reached the lower limit of the range of values for this
+				 * index attr.  On a backward scan, we can stop if this qual
+				 * is one of the "must match" subset.  We can stop regardless
+				 * of whether the qual is > or <, so long as it's required,
+				 * because it's not possible for any future tuples to pass. On
+				 * a forward scan, however, we must keep going, because we may
+				 * have initially positioned to the start of the index.
+				 */
+				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+					ScanDirectionIsBackward(dir))
+					*continuescan = false;
+
+				if ((key->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQSKIPBKWD)) &&
+					ScanDirectionIsBackward(dir))
+					*prefixSkipIndex = key->sk_attno - 1;
+			}
+			else
+			{
+				/*
+				 * Since NULLs are sorted after non-NULLs, we know we have
+				 * reached the upper limit of the range of values for this
+				 * index attr.  On a forward scan, we can stop if this qual is
+				 * one of the "must match" subset.  We can stop regardless of
+				 * whether the qual is > or <, so long as it's required,
+				 * because it's not possible for any future tuples to pass. On
+				 * a backward scan, however, we must keep going, because we
+				 * may have initially positioned to the end of the index.
+				 */
+				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+					ScanDirectionIsForward(dir))
+					*continuescan = false;
+				if ((key->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQSKIPBKWD)) &&
+					ScanDirectionIsBackward(dir))
+					*prefixSkipIndex = key->sk_attno - 1;
+			}
+
+			/*
+			 * In any case, this indextuple doesn't match the qual.
+			 */
+			return false;
+		}
+
+
+		/* Perform the test --- three-way comparison not bool operator */
+		cmpresult = DatumGetInt32(FunctionCall2Coll(&key->sk_func,
+													key->sk_collation,
+													datum,
+													key->sk_argument));
+
+		if (key->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(cmpresult);
+
+		if (cmpresult != 0)
+		{
+			/*
+			 * Tuple fails this qual.  If it's a required qual for the current
+			 * scan direction, then we can conclude no further tuples will
+			 * pass, either.
+			 *
+			 * Note: because we stop the scan as soon as any required equality
+			 * qual fails, it is critical that equality quals be used for the
+			 * initial positioning in _bt_first() when they are available. See
+			 * comments in _bt_first().
+			 */
+			if ((key->sk_flags & SK_BT_REQFWD) &&
+				ScanDirectionIsForward(dir) && cmpresult > 0)
+				*continuescan = false;
+			else if ((key->sk_flags & SK_BT_REQBKWD) &&
+					 ScanDirectionIsBackward(dir) && cmpresult < 0)
+				*continuescan = false;
+
+			if ((key->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir) && cmpresult > 0)
+				*prefixSkipIndex = key->sk_attno - 1;
+			else if ((key->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir) && cmpresult < 0)
+				*prefixSkipIndex = key->sk_attno - 1;
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1520,7 +1760,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
-					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
+					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan, int *prefixSkipIndex)
 {
 	ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
 	int32		cmpresult = 0;
@@ -1576,6 +1816,10 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
 					*continuescan = false;
+
+				if ((subkey->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQBKWD) &&
+					ScanDirectionIsBackward(dir)))
+					*prefixSkipIndex = subkey->sk_attno - 1;
 			}
 			else
 			{
@@ -1592,6 +1836,10 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
 					*continuescan = false;
+
+				if ((subkey->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQBKWD) &&
+					ScanDirectionIsForward(dir)))
+					*prefixSkipIndex = subkey->sk_attno - 1;
 			}
 
 			/*
@@ -1616,6 +1864,13 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
 					 ScanDirectionIsBackward(dir))
 				*continuescan = false;
+
+			if ((subkey->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir))
+				*prefixSkipIndex = subkey->sk_attno - 1;
+			else if ((subkey->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir))
+				*prefixSkipIndex = subkey->sk_attno - 1;
 			return false;
 		}
 
@@ -1678,6 +1933,13 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 		else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
 				 ScanDirectionIsBackward(dir))
 			*continuescan = false;
+
+		if ((subkey->sk_flags & SK_BT_REQSKIPFWD) &&
+			ScanDirectionIsForward(dir))
+			*prefixSkipIndex = subkey->sk_attno - 1;
+		else if ((subkey->sk_flags & SK_BT_REQSKIPBKWD) &&
+				 ScanDirectionIsBackward(dir))
+			*prefixSkipIndex = subkey->sk_attno - 1;
 	}
 
 	return result;
@@ -2748,3 +3010,524 @@ _bt_allequalimage(Relation rel, bool debugmessage)
 
 	return allequalimage;
 }
+
+void _bt_set_bsearch_flags(StrategyNumber stratTotal, ScanDirection dir, bool* nextkey, bool* goback)
+{
+	/*----------
+	 * Examine the selected initial-positioning strategy to determine exactly
+	 * where we need to start the scan, and set flag variables to control the
+	 * code below.
+	 *
+	 * If nextkey = false, _bt_search and _bt_binsrch will locate the first
+	 * item >= scan key.  If nextkey = true, they will locate the first
+	 * item > scan key.
+	 *
+	 * If goback = true, we will then step back one item, while if
+	 * goback = false, we will start the scan on the located item.
+	 *----------
+	 */
+	switch (stratTotal)
+	{
+		case BTLessStrategyNumber:
+
+			/*
+			 * Find first item >= scankey, then back up one to arrive at last
+			 * item < scankey.  (Note: this positioning strategy is only used
+			 * for a backward scan, so that is always the correct starting
+			 * position.)
+			 */
+			*nextkey = false;
+			*goback = true;
+			break;
+
+		case BTLessEqualStrategyNumber:
+
+			/*
+			 * Find first item > scankey, then back up one to arrive at last
+			 * item <= scankey.  (Note: this positioning strategy is only used
+			 * for a backward scan, so that is always the correct starting
+			 * position.)
+			 */
+			*nextkey = true;
+			*goback = true;
+			break;
+
+		case BTEqualStrategyNumber:
+
+			/*
+			 * If a backward scan was specified, need to start with last equal
+			 * item not first one.
+			 */
+			if (ScanDirectionIsBackward(dir))
+			{
+				/*
+				 * This is the same as the <= strategy.  We will check at the
+				 * end whether the found item is actually =.
+				 */
+				*nextkey = true;
+				*goback = true;
+			}
+			else
+			{
+				/*
+				 * This is the same as the >= strategy.  We will check at the
+				 * end whether the found item is actually =.
+				 */
+				*nextkey = false;
+				*goback = false;
+			}
+			break;
+
+		case BTGreaterEqualStrategyNumber:
+
+			/*
+			 * Find first item >= scankey.  (This is only used for forward
+			 * scans.)
+			 */
+			*nextkey = false;
+			*goback = false;
+			break;
+
+		case BTGreaterStrategyNumber:
+
+			/*
+			 * Find first item > scankey.  (This is only used for forward
+			 * scans.)
+			 */
+			*nextkey = true;
+			*goback = false;
+			break;
+
+		default:
+			/* can't get here, but keep compiler quiet */
+			elog(ERROR, "unrecognized strat_total: %d", (int) stratTotal);
+	}
+}
+
+bool _bt_create_insertion_scan_key(Relation	rel, ScanDirection dir, ScanKey* startKeys, int keysCount, BTScanInsert inskey, StrategyNumber* stratTotal,  bool* goback)
+{
+	int i;
+	bool nextkey;
+
+	/*
+	 * We want to start the scan somewhere within the index.  Set up an
+	 * insertion scankey we can use to search for the boundary point we
+	 * identified above.  The insertion scankey is built using the keys
+	 * identified by startKeys[].  (Remaining insertion scankey fields are
+	 * initialized after initial-positioning strategy is finalized.)
+	 */
+	Assert(keysCount <= INDEX_MAX_KEYS);
+	for (i = 0; i < keysCount; i++)
+	{
+		ScanKey		cur = startKeys[i];
+
+		if (cur == NULL)
+		{
+			inskey->scankeys[i].sk_attno = 0;
+			continue;
+		}
+
+		Assert(cur->sk_attno == i + 1);
+
+		if (cur->sk_flags & SK_ROW_HEADER)
+		{
+			/*
+			 * Row comparison header: look to the first row member instead.
+			 *
+			 * The member scankeys are already in insertion format (ie, they
+			 * have sk_func = 3-way-comparison function), but we have to watch
+			 * out for nulls, which _bt_preprocess_keys didn't check. A null
+			 * in the first row member makes the condition unmatchable, just
+			 * like qual_ok = false.
+			 */
+			ScanKey		subkey = (ScanKey) DatumGetPointer(cur->sk_argument);
+
+			Assert(subkey->sk_flags & SK_ROW_MEMBER);
+			if (subkey->sk_flags & SK_ISNULL)
+			{
+				return false;
+			}
+			memcpy(inskey->scankeys + i, subkey, sizeof(ScanKeyData));
+
+			/*
+			 * If the row comparison is the last positioning key we accepted,
+			 * try to add additional keys from the lower-order row members.
+			 * (If we accepted independent conditions on additional index
+			 * columns, we use those instead --- doesn't seem worth trying to
+			 * determine which is more restrictive.)  Note that this is OK
+			 * even if the row comparison is of ">" or "<" type, because the
+			 * condition applied to all but the last row member is effectively
+			 * ">=" or "<=", and so the extra keys don't break the positioning
+			 * scheme.  But, by the same token, if we aren't able to use all
+			 * the row members, then the part of the row comparison that we
+			 * did use has to be treated as just a ">=" or "<=" condition, and
+			 * so we'd better adjust strat_total accordingly.
+			 */
+			if (i == keysCount - 1)
+			{
+				bool		used_all_subkeys = false;
+
+				Assert(!(subkey->sk_flags & SK_ROW_END));
+				for (;;)
+				{
+					subkey++;
+					Assert(subkey->sk_flags & SK_ROW_MEMBER);
+					if (subkey->sk_attno != keysCount + 1)
+						break;	/* out-of-sequence, can't use it */
+					if (subkey->sk_strategy != cur->sk_strategy)
+						break;	/* wrong direction, can't use it */
+					if (subkey->sk_flags & SK_ISNULL)
+						break;	/* can't use null keys */
+					Assert(keysCount < INDEX_MAX_KEYS);
+					memcpy(inskey->scankeys + keysCount, subkey,
+						   sizeof(ScanKeyData));
+					keysCount++;
+					if (subkey->sk_flags & SK_ROW_END)
+					{
+						used_all_subkeys = true;
+						break;
+					}
+				}
+				if (!used_all_subkeys)
+				{
+					switch (*stratTotal)
+					{
+						case BTLessStrategyNumber:
+							*stratTotal = BTLessEqualStrategyNumber;
+							break;
+						case BTGreaterStrategyNumber:
+							*stratTotal = BTGreaterEqualStrategyNumber;
+							break;
+					}
+				}
+				break;			/* done with outer loop */
+			}
+		}
+		else
+		{
+			/*
+			 * Ordinary comparison key.  Transform the search-style scan key
+			 * to an insertion scan key by replacing the sk_func with the
+			 * appropriate btree comparison function.
+			 *
+			 * If scankey operator is not a cross-type comparison, we can use
+			 * the cached comparison function; otherwise gotta look it up in
+			 * the catalogs.  (That can't lead to infinite recursion, since no
+			 * indexscan initiated by syscache lookup will use cross-data-type
+			 * operators.)
+			 *
+			 * We support the convention that sk_subtype == InvalidOid means
+			 * the opclass input type; this is a hack to simplify life for
+			 * ScanKeyInit().
+			 */
+			if (cur->sk_subtype == rel->rd_opcintype[i] ||
+				cur->sk_subtype == InvalidOid)
+			{
+				FmgrInfo   *procinfo;
+
+				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
+				ScanKeyEntryInitializeWithInfo(inskey->scankeys + i,
+											   cur->sk_flags,
+											   cur->sk_attno,
+											   cur->sk_strategy,
+											   cur->sk_subtype,
+											   cur->sk_collation,
+											   procinfo,
+											   cur->sk_argument);
+			}
+			else
+			{
+				RegProcedure cmp_proc;
+
+				cmp_proc = get_opfamily_proc(rel->rd_opfamily[i],
+											 rel->rd_opcintype[i],
+											 cur->sk_subtype,
+											 BTORDER_PROC);
+				if (!RegProcedureIsValid(cmp_proc))
+					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
+						 cur->sk_attno, RelationGetRelationName(rel));
+				ScanKeyEntryInitialize(inskey->scankeys + i,
+									   cur->sk_flags,
+									   cur->sk_attno,
+									   cur->sk_strategy,
+									   cur->sk_subtype,
+									   cur->sk_collation,
+									   cmp_proc,
+									   cur->sk_argument);
+			}
+		}
+	}
+
+	_bt_set_bsearch_flags(*stratTotal, dir, &nextkey, goback);
+
+	/* Initialize remaining insertion scan key fields */
+	_bt_metaversion(rel, &inskey->heapkeyspace, &inskey->allequalimage);
+	inskey->anynullkeys = false; /* unused */
+	inskey->nextkey = nextkey;
+	inskey->pivotsearch = false;
+	inskey->scantid = NULL;
+	inskey->keysz = keysCount;
+
+	return true;
+}
+
+/*----------
+ * Examine the scan keys to discover where we need to start the scan.
+ *
+ * We want to identify the keys that can be used as starting boundaries;
+ * these are =, >, or >= keys for a forward scan or =, <, <= keys for
+ * a backwards scan.  We can use keys for multiple attributes so long as
+ * the prior attributes had only =, >= (resp. =, <=) keys.  Once we accept
+ * a > or < boundary or find an attribute with no boundary (which can be
+ * thought of as the same as "> -infinity"), we can't use keys for any
+ * attributes to its right, because it would break our simplistic notion
+ * of what initial positioning strategy to use.
+ *
+ * When the scan keys include cross-type operators, _bt_preprocess_keys
+ * may not be able to eliminate redundant keys; in such cases we will
+ * arbitrarily pick a usable one for each attribute.  This is correct
+ * but possibly not optimal behavior.  (For example, with keys like
+ * "x >= 4 AND x >= 5" we would elect to scan starting at x=4 when
+ * x=5 would be more efficient.)  Since the situation only arises given
+ * a poorly-worded query plus an incomplete opfamily, live with it.
+ *
+ * When both equality and inequality keys appear for a single attribute
+ * (again, only possible when cross-type operators appear), we *must*
+ * select one of the equality keys for the starting point, because
+ * _bt_checkkeys() will stop the scan as soon as an equality qual fails.
+ * For example, if we have keys like "x >= 4 AND x = 10" and we elect to
+ * start at x=4, we will fail and stop before reaching x=10.  If multiple
+ * equality quals survive preprocessing, however, it doesn't matter which
+ * one we use --- by definition, they are either redundant or
+ * contradictory.
+ *
+ * Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
+ * If the index stores nulls at the end of the index we'll be starting
+ * from, and we have no boundary key for the column (which means the key
+ * we deduced NOT NULL from is an inequality key that constrains the other
+ * end of the index), then we cons up an explicit SK_SEARCHNOTNULL key to
+ * use as a boundary key.  If we didn't do this, we might find ourselves
+ * traversing a lot of null entries at the start of the scan.
+ *
+ * In this loop, row-comparison keys are treated the same as keys on their
+ * first (leftmost) columns.  We'll add on lower-order columns of the row
+ * comparison below, if possible.
+ *
+ * The selected scan keys (at most one per index column) are remembered by
+ * storing their addresses into the local startKeys[] array.
+ *----------
+ */
+int _bt_choose_scan_keys(ScanKey scanKeys, int numberOfKeys, ScanDirection dir, ScanKey* startKeys, ScanKeyData* notnullkeys, StrategyNumber* stratTotal, int prefix)
+{
+	StrategyNumber strat;
+	int			keysCount = 0;
+	int			i;
+
+	*stratTotal = BTEqualStrategyNumber;
+	if (numberOfKeys > 0 || prefix > 0)
+	{
+		AttrNumber	curattr;
+		ScanKey		chosen;
+		ScanKey		impliesNN;
+		ScanKey		cur;
+
+		/*
+		 * chosen is the so-far-chosen key for the current attribute, if any.
+		 * We don't cast the decision in stone until we reach keys for the
+		 * next attribute.
+		 */
+		curattr = 1;
+		chosen = NULL;
+		/* Also remember any scankey that implies a NOT NULL constraint */
+		impliesNN = NULL;
+
+		/*
+		 * Loop iterates from 0 to numberOfKeys inclusive; we use the last
+		 * pass to handle after-last-key processing.  Actual exit from the
+		 * loop is at one of the "break" statements below.
+		 */
+		for (cur = scanKeys, i = 0;; cur++, i++)
+		{
+			if (i >= numberOfKeys || cur->sk_attno != curattr)
+			{
+				/*
+				 * Done looking at keys for curattr.  If we didn't find a
+				 * usable boundary key, see if we can deduce a NOT NULL key.
+				 */
+				if (chosen == NULL && impliesNN != NULL &&
+					((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
+					 ScanDirectionIsForward(dir) :
+					 ScanDirectionIsBackward(dir)))
+				{
+					/* Yes, so build the key in notnullkeys[keysCount] */
+					chosen = &notnullkeys[keysCount];
+					ScanKeyEntryInitialize(chosen,
+										   (SK_SEARCHNOTNULL | SK_ISNULL |
+											(impliesNN->sk_flags &
+											 (SK_BT_DESC | SK_BT_NULLS_FIRST))),
+										   curattr,
+										   ((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
+											BTGreaterStrategyNumber :
+											BTLessStrategyNumber),
+										   InvalidOid,
+										   InvalidOid,
+										   InvalidOid,
+										   (Datum) 0);
+				}
+
+				/*
+				 * If we still didn't find a usable boundary key, quit; else
+				 * save the boundary key pointer in startKeys.
+				 */
+				if (chosen == NULL && curattr > prefix)
+					break;
+				startKeys[keysCount++] = chosen;
+
+				/*
+				 * Adjust strat_total, and quit if we have stored a > or <
+				 * key.
+				 */
+				if (chosen != NULL && curattr > prefix)
+				{
+					strat = chosen->sk_strategy;
+					if (strat != BTEqualStrategyNumber)
+					{
+						*stratTotal = strat;
+						if (strat == BTGreaterStrategyNumber ||
+							strat == BTLessStrategyNumber)
+							break;
+					}
+				}
+
+				/*
+				 * Done if that was the last attribute, or if next key is not
+				 * in sequence (implying no boundary key is available for the
+				 * next attribute).
+				 */
+				if (i >= numberOfKeys)
+				{
+					curattr++;
+					while(curattr <= prefix)
+					{
+						startKeys[keysCount++] = NULL;
+						curattr++;
+					}
+					break;
+				}
+				else if (cur->sk_attno != curattr + 1)
+				{
+					curattr++;
+					while(curattr < cur->sk_attno && curattr <= prefix)
+					{
+						startKeys[keysCount++] = NULL;
+						curattr++;
+					}
+					if (curattr > prefix && curattr != cur->sk_attno)
+						break;
+				}
+				else
+				{
+					curattr++;
+				}
+
+				/*
+				 * Reset for next attr.
+				 */
+				chosen = NULL;
+				impliesNN = NULL;
+			}
+
+			/*
+			 * Can we use this key as a starting boundary for this attr?
+			 *
+			 * If not, does it imply a NOT NULL constraint?  (Because
+			 * SK_SEARCHNULL keys are always assigned BTEqualStrategyNumber,
+			 * *any* inequality key works for that; we need not test.)
+			 */
+			switch (cur->sk_strategy)
+			{
+				case BTLessStrategyNumber:
+				case BTLessEqualStrategyNumber:
+					if (chosen == NULL)
+					{
+						if (ScanDirectionIsBackward(dir))
+							chosen = cur;
+						else
+							impliesNN = cur;
+					}
+					break;
+				case BTEqualStrategyNumber:
+					/* override any non-equality choice */
+					chosen = cur;
+					break;
+				case BTGreaterEqualStrategyNumber:
+				case BTGreaterStrategyNumber:
+					if (chosen == NULL)
+					{
+						if (ScanDirectionIsForward(dir))
+							chosen = cur;
+						else
+							impliesNN = cur;
+					}
+					break;
+			}
+		}
+	}
+	return keysCount;
+}
+
+void print_itup(BlockNumber blk, IndexTuple left, IndexTuple right, Relation rel, char *extra)
+{
+	bool		isnull[INDEX_MAX_KEYS];
+	Datum		values[INDEX_MAX_KEYS];
+	char	   *lkey_desc = NULL;
+	char	   *rkey_desc;
+
+	/* Avoid infinite recursion -- don't instrument catalog indexes */
+	if (!IsCatalogRelation(rel))
+	{
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(left, rel);
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(left, itupdesc, values, isnull);
+		rel->rd_index->indnkeyatts = natts;
+
+		/*
+		 * Since the regression tests should pass when the instrumentation
+		 * patch is applied, be prepared for BuildIndexValueDescription() to
+		 * return NULL due to security considerations.
+		 */
+		lkey_desc = BuildIndexValueDescription(rel, values, isnull);
+		if (lkey_desc && right)
+		{
+			/*
+			 * Revolting hack: modify tuple descriptor to have number of key
+			 * columns actually present in caller's pivot tuples
+			 */
+			natts = BTreeTupleGetNAtts(right, rel);
+			itupdesc->natts = Min(indnkeyatts, natts);
+			memset(&isnull, 0xFF, sizeof(isnull));
+			index_deform_tuple(right, itupdesc, values, isnull);
+			rel->rd_index->indnkeyatts = natts;
+			rkey_desc = BuildIndexValueDescription(rel, values, isnull);
+			elog(DEBUG1, "%s blk %u sk > %s, sk <= %s %s",
+				 RelationGetRelationName(rel), blk, lkey_desc, rkey_desc,
+				 extra);
+			pfree(rkey_desc);
+		}
+		else
+			elog(DEBUG1, "%s blk %u sk check %s %s",
+				 RelationGetRelationName(rel), blk, lkey_desc, extra);
+
+		/* Cleanup */
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
+		if (lkey_desc)
+			pfree(lkey_desc);
+	}
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 0efe05e552..54ffc05769 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -69,6 +69,9 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = spgcostestimate;
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a283e4d45c..389eb1528c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -148,6 +148,7 @@ static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
 static void ExplainIndentText(ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 
 
@@ -1096,6 +1097,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
 }
 
+/*
+ * ExplainIndexSkipScanKeys -
+ *	  Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es)
+{
+	if (skipPrefixSize > 0)
+	{
+		if (es->format != EXPLAIN_FORMAT_TEXT)
+			ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+	}
+}
+
 /*
  * ExplainNode -
  *	  Appends a description of a plan tree to es->str
@@ -1433,6 +1450,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexScan  *indexscan = (IndexScan *) plan;
 
+				if (indexscan->indexdistinct)
+					ExplainIndexSkipScanKeys(indexscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexscan->indexid,
 										indexscan->indexorderdir,
 										es);
@@ -1443,6 +1463,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
 
+				if (indexonlyscan->indexdistinct)
+					ExplainIndexSkipScanKeys(indexonlyscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexonlyscan->indexid,
 										indexonlyscan->indexorderdir,
 										es);
@@ -1703,6 +1726,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_IndexScan:
+			if (((IndexScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyText("Skip scan", ((IndexScan *) plan)->indexdistinct ? "Distinct only" : "All", es);
 			show_scan_qual(((IndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexScan *) plan)->indexqualorig)
@@ -1716,6 +1741,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 										   planstate, es);
 			break;
 		case T_IndexOnlyScan:
+			if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyText("Skip scan", ((IndexOnlyScan *) plan)->indexdistinct ? "Distinct only" : "All", es);
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexOnlyScan *) plan)->indexqual)
@@ -1732,6 +1759,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 									 planstate->instrument->ntuples2, 0, es);
 			break;
 		case T_BitmapIndexScan:
+			if (((BitmapIndexScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyText("Skip scan", "All", es);
 			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
 			break;
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 642805d90c..0e77f241f9 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -133,6 +133,14 @@ ExecScanFetch(ScanState *node,
 	return (*accessMtd) (node);
 }
 
+TupleTableSlot *
+ExecScan(ScanState *node,
+		 ExecScanAccessMtd accessMtd,	/* function returning a tuple */
+		 ExecScanRecheckMtd recheckMtd)
+{
+	return ExecScanExtended(node, accessMtd, recheckMtd, NULL);
+}
+
 /* ----------------------------------------------------------------
  *		ExecScan
  *
@@ -155,9 +163,10 @@ ExecScanFetch(ScanState *node,
  * ----------------------------------------------------------------
  */
 TupleTableSlot *
-ExecScan(ScanState *node,
+ExecScanExtended(ScanState *node,
 		 ExecScanAccessMtd accessMtd,	/* function returning a tuple */
-		 ExecScanRecheckMtd recheckMtd)
+		 ExecScanRecheckMtd recheckMtd,
+		 ExecScanSkipMtd skipMtd)
 {
 	ExprContext *econtext;
 	ExprState  *qual;
@@ -170,6 +179,20 @@ ExecScan(ScanState *node,
 	projInfo = node->ps.ps_ProjInfo;
 	econtext = node->ps.ps_ExprContext;
 
+	if (skipMtd != NULL && node->ss_FirstTupleEmitted)
+	{
+		bool cont = skipMtd(node);
+		if (!cont)
+		{
+			node->ss_FirstTupleEmitted = false;
+			return ExecClearTuple(node->ss_ScanTupleSlot);
+		}
+	}
+	else
+	{
+		node->ss_FirstTupleEmitted = true;
+	}
+
 	/* interrupt checks are in ExecScanFetch */
 
 	/*
@@ -178,8 +201,13 @@ ExecScan(ScanState *node,
 	 */
 	if (!qual && !projInfo)
 	{
+		TupleTableSlot *slot;
+
 		ResetExprContext(econtext);
-		return ExecScanFetch(node, accessMtd, recheckMtd);
+		slot = ExecScanFetch(node, accessMtd, recheckMtd);
+		if (TupIsNull(slot))
+			node->ss_FirstTupleEmitted = false;
+		return slot;
 	}
 
 	/*
@@ -206,6 +234,7 @@ ExecScan(ScanState *node,
 		 */
 		if (TupIsNull(slot))
 		{
+			node->ss_FirstTupleEmitted = false;
 			if (projInfo)
 				return ExecClearTuple(projInfo->pi_state.resultslot);
 			else
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 81a1208157..602c64fc91 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -22,13 +22,14 @@
 #include "postgres.h"
 
 #include "access/genam.h"
+#include "access/relscan.h"
 #include "executor/execdebug.h"
 #include "executor/nodeBitmapIndexscan.h"
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
+#include "utils/rel.h"
 #include "utils/memutils.h"
 
-
 /* ----------------------------------------------------------------
  *		ExecBitmapIndexScan
  *
@@ -308,10 +309,20 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
 	/*
 	 * Initialize scan descriptor.
 	 */
-	indexstate->biss_ScanDesc =
-		index_beginscan_bitmap(indexstate->biss_RelationDesc,
-							   estate->es_snapshot,
-							   indexstate->biss_NumScanKeys);
+	if (node->indexskipprefixsize > 0)
+	{
+		indexstate->biss_ScanDesc =
+			index_beginscan_bitmap_skip(indexstate->biss_RelationDesc,
+				estate->es_snapshot,
+				indexstate->biss_NumScanKeys,
+				Min(IndexRelationGetNumberOfKeyAttributes(indexstate->biss_RelationDesc),
+					node->indexskipprefixsize));
+	}
+	else
+		indexstate->biss_ScanDesc =
+			index_beginscan_bitmap(indexstate->biss_RelationDesc,
+								   estate->es_snapshot,
+								   indexstate->biss_NumScanKeys);
 
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 5617ac29e7..aadba4a0fe 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -41,6 +41,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/itemptr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -49,6 +50,37 @@ static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup,
 							TupleDesc itupdesc);
 
+static bool
+IndexOnlySkip(IndexOnlyScanState *node)
+{
+	EState	   *estate;
+	ScanDirection direction;
+	IndexScanDesc scandesc;
+	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
+
+	if (!node->ioss_Distinct)
+		return true;
+
+	/*
+	 * extract necessary information from index scan node
+	 */
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	/* flip direction if this is an overall backward scan */
+	if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
+	{
+		if (ScanDirectionIsForward(direction))
+			direction = BackwardScanDirection;
+		else if (ScanDirectionIsBackward(direction))
+			direction = ForwardScanDirection;
+	}
+	scandesc = node->ioss_ScanDesc;
+
+	if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir))
+		return false;
+
+	return true;
+}
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -65,6 +97,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	ItemPointerData startTid;
+	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
 
 	/*
 	 * extract necessary information from index scan node
@@ -72,7 +106,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -90,11 +124,19 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * serially executing an index only scan that was planned to be
 		 * parallel.
 		 */
-		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->ioss_RelationDesc,
-								   estate->es_snapshot,
-								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+		if (node->ioss_SkipPrefixSize > 0)
+			scandesc = index_beginscan_skip(node->ss.ss_currentRelation,
+									   node->ioss_RelationDesc,
+									   estate->es_snapshot,
+									   node->ioss_NumScanKeys,
+									   node->ioss_NumOrderByKeys,
+									   Min(IndexRelationGetNumberOfKeyAttributes(node->ioss_RelationDesc), node->ioss_SkipPrefixSize));
+		else
+			scandesc = index_beginscan(node->ss.ss_currentRelation,
+									   node->ioss_RelationDesc,
+									   estate->es_snapshot,
+									   node->ioss_NumScanKeys,
+									   node->ioss_NumOrderByKeys);
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -114,11 +156,16 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
 	}
+	else
+	{
+		ItemPointerCopy(&scandesc->xs_heaptid, &startTid);
+	}
 
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while ((tid = node->ioss_SkipPrefixSize > 0 ? index_getnext_tid_skip(scandesc, direction, node->ioss_Distinct ? indexonlyscan->indexorderdir : direction) :
+			index_getnext_tid(scandesc, direction)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
@@ -314,9 +361,10 @@ ExecIndexOnlyScan(PlanState *pstate)
 	if (node->ioss_NumRuntimeKeys != 0 && !node->ioss_RuntimeKeysReady)
 		ExecReScan((PlanState *) node);
 
-	return ExecScan(&node->ss,
+	return ExecScanExtended(&node->ss,
 					(ExecScanAccessMtd) IndexOnlyNext,
-					(ExecScanRecheckMtd) IndexOnlyRecheck);
+					(ExecScanRecheckMtd) IndexOnlyRecheck,
+					node->ioss_Distinct ? (ExecScanSkipMtd) IndexOnlySkip : NULL);
 }
 
 /* ----------------------------------------------------------------
@@ -503,7 +551,10 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate = makeNode(IndexOnlyScanState);
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
+	indexstate->ss.ss_FirstTupleEmitted = false;
 	indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+	indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->ioss_Distinct = node->indexdistinct;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index d0a96a38e0..db3b5a3379 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -69,6 +69,37 @@ static void reorderqueue_push(IndexScanState *node, TupleTableSlot *slot,
 							  Datum *orderbyvals, bool *orderbynulls);
 static HeapTuple reorderqueue_pop(IndexScanState *node);
 
+static bool
+IndexSkip(IndexScanState *node)
+{
+	EState	   *estate;
+	ScanDirection direction;
+	IndexScanDesc scandesc;
+	IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
+
+	if (!node->iss_Distinct)
+		return true;
+
+	/*
+	 * extract necessary information from index scan node
+	 */
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	/* flip direction if this is an overall backward scan */
+	if (ScanDirectionIsBackward(indexscan->indexorderdir))
+	{
+		if (ScanDirectionIsForward(direction))
+			direction = BackwardScanDirection;
+		else if (ScanDirectionIsBackward(direction))
+			direction = ForwardScanDirection;
+	}
+	scandesc = node->iss_ScanDesc;
+
+	if (!index_skip(scandesc, direction, indexscan->indexorderdir))
+		return false;
+
+	return true;
+}
 
 /* ----------------------------------------------------------------
  *		IndexNext
@@ -85,6 +116,7 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
 
 	/*
 	 * extract necessary information from index scan node
@@ -92,7 +124,7 @@ IndexNext(IndexScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -109,14 +141,25 @@ IndexNext(IndexScanState *node)
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
 		 */
-		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
-								   estate->es_snapshot,
-								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+		if (node->iss_SkipPrefixSize > 0)
+			scandesc = index_beginscan_skip(node->ss.ss_currentRelation,
+									   node->iss_RelationDesc,
+									   estate->es_snapshot,
+									   node->iss_NumScanKeys,
+									   node->iss_NumOrderByKeys,
+									   Min(IndexRelationGetNumberOfKeyAttributes(node->iss_RelationDesc), node->iss_SkipPrefixSize));
+		else
+			scandesc = index_beginscan(node->ss.ss_currentRelation,
+									   node->iss_RelationDesc,
+									   estate->es_snapshot,
+									   node->iss_NumScanKeys,
+									   node->iss_NumOrderByKeys);
 
 		node->iss_ScanDesc = scandesc;
 
+		/* Index skip scan assumes xs_want_itup, so set it to true if we skip over distinct */
+		node->iss_ScanDesc->xs_want_itup = indexscan->indexdistinct;
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -130,7 +173,9 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (node->iss_SkipPrefixSize > 0 ?
+		   index_getnext_slot_skip(scandesc, direction, node->iss_Distinct ? indexscan->indexorderdir : direction, slot) :
+		   index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -530,13 +575,15 @@ ExecIndexScan(PlanState *pstate)
 		ExecReScan((PlanState *) node);
 
 	if (node->iss_NumOrderByKeys > 0)
-		return ExecScan(&node->ss,
+		return ExecScanExtended(&node->ss,
 						(ExecScanAccessMtd) IndexNextWithReorder,
-						(ExecScanRecheckMtd) IndexRecheck);
+						(ExecScanRecheckMtd) IndexRecheck,
+						node->iss_Distinct ? (ExecScanSkipMtd) IndexSkip : NULL);
 	else
-		return ExecScan(&node->ss,
+		return ExecScanExtended(&node->ss,
 						(ExecScanAccessMtd) IndexNext,
-						(ExecScanRecheckMtd) IndexRecheck);
+						(ExecScanRecheckMtd) IndexRecheck,
+						node->iss_Distinct ? (ExecScanSkipMtd) IndexSkip : NULL);
 }
 
 /* ----------------------------------------------------------------
@@ -910,6 +957,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+	indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->iss_Distinct = node->indexdistinct;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 1f50400fd2..6d67efe54e 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -493,6 +493,8 @@ _copyIndexScan(const IndexScan *from)
 	COPY_NODE_FIELD(indexorderbyorig);
 	COPY_NODE_FIELD(indexorderbyops);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
+	COPY_SCALAR_FIELD(indexdistinct);
 
 	return newnode;
 }
@@ -518,6 +520,8 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
 	COPY_NODE_FIELD(indexorderby);
 	COPY_NODE_FIELD(indextlist);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
+	COPY_SCALAR_FIELD(indexdistinct);
 
 	return newnode;
 }
@@ -542,6 +546,7 @@ _copyBitmapIndexScan(const BitmapIndexScan *from)
 	COPY_SCALAR_FIELD(isshared);
 	COPY_NODE_FIELD(indexqual);
 	COPY_NODE_FIELD(indexqualorig);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c3a9632992..5f395547f6 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -562,6 +562,8 @@ _outIndexScan(StringInfo str, const IndexScan *node)
 	WRITE_NODE_FIELD(indexorderbyorig);
 	WRITE_NODE_FIELD(indexorderbyops);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
+	WRITE_INT_FIELD(indexdistinct);
 }
 
 static void
@@ -576,6 +578,9 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
 	WRITE_NODE_FIELD(indexorderby);
 	WRITE_NODE_FIELD(indextlist);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
+	WRITE_INT_FIELD(indexdistinct);
+
 }
 
 static void
@@ -589,6 +594,7 @@ _outBitmapIndexScan(StringInfo str, const BitmapIndexScan *node)
 	WRITE_BOOL_FIELD(isshared);
 	WRITE_NODE_FIELD(indexqual);
 	WRITE_NODE_FIELD(indexqualorig);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 3a18571d0c..d0bcf8a391 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1829,6 +1829,8 @@ _readIndexScan(void)
 	READ_NODE_FIELD(indexorderbyorig);
 	READ_NODE_FIELD(indexorderbyops);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
+	READ_INT_FIELD(indexdistinct);
 
 	READ_DONE();
 }
@@ -1848,6 +1850,8 @@ _readIndexOnlyScan(void)
 	READ_NODE_FIELD(indexorderby);
 	READ_NODE_FIELD(indextlist);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
+	READ_INT_FIELD(indexdistinct);
 
 	READ_DONE();
 }
@@ -1866,6 +1870,7 @@ _readBitmapIndexScan(void)
 	READ_BOOL_FIELD(isshared);
 	READ_NODE_FIELD(indexqual);
 	READ_NODE_FIELD(indexqualorig);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 945aa93374..fe6ef62e8f 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -125,6 +125,7 @@ int			max_parallel_workers_per_gather = 2;
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
+bool		enable_indexskipscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index bcb1bc6097..216dd4611e 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -780,6 +780,16 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexPath  *ipath = (IndexPath *) lfirst(lc);
 
+		/*
+		 * To prevent unique paths from index skip scans being potentially used
+		 * when not needed scan keep them in a separate pathlist.
+		*/
+		if (ipath->indexdistinct)
+		{
+			add_unique_path(rel, (Path *) ipath);
+			continue;
+		}
+
 		if (index->amhasgettuple)
 			add_path(rel, (Path *) ipath);
 
@@ -868,6 +878,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
+	bool		can_skip;
 	int			indexcol;
 
 	/*
@@ -1017,6 +1028,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	index_only_scan = (scantype != ST_BITMAPSCAN &&
 					   check_index_only(rel, index));
 
+	/* Check if an index skip scan is possible. */
+	can_skip = enable_indexskipscan & index->amcanskip;
+
 	/*
 	 * 4. Generate an indexscan path if there are relevant restriction clauses
 	 * in the current clauses, OR the index ordering is potentially useful for
@@ -1040,6 +1054,33 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  false);
 		result = lappend(result, ipath);
 
+		/* Consider index skip scan as well */
+		if (root->query_uniquekeys != NULL && can_skip)
+		{
+			int numusefulkeys = list_length(useful_pathkeys);
+			int numsortkeys = list_length(root->query_pathkeys);
+
+			if (numusefulkeys == numsortkeys)
+			{
+				int prefix;
+				if (list_length(root->distinct_pathkeys) > 0)
+					prefix = find_index_prefix_for_pathkey(root,
+														   index,
+														   ForwardScanDirection,
+														   llast_node(PathKey,
+														   root->distinct_pathkeys));
+				else
+					/* all are distinct keys are constant and optimized away.
+					 * skipping with 1 is sufficient as all are constant anyway
+					 */
+					prefix = 1;
+
+				result = lappend(result,
+								 create_skipscan_unique_path(root, index,
+															 (Path *) ipath, prefix));
+			}
+		}
+
 		/*
 		 * If appropriate, consider parallel index scan.  We don't allow
 		 * parallel index scan for bitmap index scans.
@@ -1095,6 +1136,33 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  false);
 			result = lappend(result, ipath);
 
+			/* Consider index skip scan as well */
+			if (root->query_uniquekeys != NULL && can_skip)
+			{
+				int numusefulkeys = list_length(useful_pathkeys);
+				int numsortkeys = list_length(root->query_pathkeys);
+
+				if (numusefulkeys == numsortkeys)
+				{
+					int prefix;
+					if (list_length(root->distinct_pathkeys) > 0)
+						prefix = find_index_prefix_for_pathkey(root,
+															   index,
+															   BackwardScanDirection,
+															   llast_node(PathKey,
+															   root->distinct_pathkeys));
+					else
+						/* all are distinct keys are constant and optimized away.
+						 * skipping with 1 is sufficient as all are constant anyway
+						 */
+						prefix = 1;
+
+					result = lappend(result,
+									 create_skipscan_unique_path(root, index,
+																 (Path *) ipath, prefix));
+				}
+			}
+
 			/* If appropriate, consider parallel index scan */
 			if (index->amcanparallel &&
 				rel->consider_parallel && outer_relids == NULL &&
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index a4fc4f252d..3fa533be95 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -522,6 +522,78 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  *		NEW PATHKEY FORMATION
  ****************************************************************************/
 
+/*
+ * Find the prefix size for a specific path key in an index.
+ * For example, an index with (a,b,c) finding path key b will
+ * return prefix 2.
+ * Returns 0 when not found.
+ */
+int
+find_index_prefix_for_pathkey(PlannerInfo *root,
+					 IndexOptInfo *index,
+					 ScanDirection scandir,
+					 PathKey *pathkey)
+{
+	ListCell   *lc;
+	int			i;
+
+	i = 0;
+	foreach(lc, index->indextlist)
+	{
+		TargetEntry *indextle = (TargetEntry *) lfirst(lc);
+		Expr	   *indexkey;
+		bool		reverse_sort;
+		bool		nulls_first;
+		PathKey    *cpathkey;
+
+		/*
+		 * INCLUDE columns are stored in index unordered, so they don't
+		 * support ordered index scan.
+		 */
+		if (i >= index->nkeycolumns)
+			break;
+
+		/* We assume we don't need to make a copy of the tlist item */
+		indexkey = indextle->expr;
+
+		if (ScanDirectionIsBackward(scandir))
+		{
+			reverse_sort = !index->reverse_sort[i];
+			nulls_first = !index->nulls_first[i];
+		}
+		else
+		{
+			reverse_sort = index->reverse_sort[i];
+			nulls_first = index->nulls_first[i];
+		}
+
+		/*
+		 * OK, try to make a canonical pathkey for this sort key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  indexkey,
+											  NULL,
+											  index->sortopfamily[i],
+											  index->opcintype[i],
+											  index->indexcollations[i],
+											  reverse_sort,
+											  nulls_first,
+											  0,
+											  index->rel->relids,
+											  false);
+
+		if (cpathkey == pathkey)
+		{
+			return i + 1;
+		}
+
+		i++;
+	}
+
+	return 0;
+}
+
 /*
  * build_index_pathkeys
  *	  Build a pathkeys list that describes the ordering induced by an index
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 9941dfe65e..b48299e7f6 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -177,15 +177,20 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 								 Oid indexid, List *indexqual, List *indexqualorig,
 								 List *indexorderby, List *indexorderbyorig,
 								 List *indexorderbyops,
-								 ScanDirection indexscandir);
+								 ScanDirection indexscandir,
+								 int skipprefix,
+								 bool distinct);
 static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 Index scanrelid, Oid indexid,
 										 List *indexqual, List *indexorderby,
 										 List *indextlist,
-										 ScanDirection indexscandir);
+										 ScanDirection indexscandir,
+										 int skipprefix,
+										 bool distinct);
 static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
 											  List *indexqual,
-											  List *indexqualorig);
+											  List *indexqualorig,
+											  int skipPrefixSize);
 static BitmapHeapScan *make_bitmap_heapscan(List *qptlist,
 											List *qpqual,
 											Plan *lefttree,
@@ -2985,7 +2990,9 @@ create_indexscan_plan(PlannerInfo *root,
 												fixed_indexquals,
 												fixed_indexorderbys,
 												best_path->indexinfo->indextlist,
-												best_path->indexscandir);
+												best_path->indexscandir,
+												best_path->indexskipprefix,
+												best_path->indexdistinct);
 	else
 		scan_plan = (Scan *) make_indexscan(tlist,
 											qpqual,
@@ -2996,7 +3003,9 @@ create_indexscan_plan(PlannerInfo *root,
 											fixed_indexorderbys,
 											indexorderbys,
 											indexorderbyops,
-											best_path->indexscandir);
+											best_path->indexscandir,
+											best_path->indexskipprefix,
+											best_path->indexdistinct);
 
 	copy_generic_path_info(&scan_plan->plan, &best_path->path);
 
@@ -3286,7 +3295,8 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		plan = (Plan *) make_bitmap_indexscan(iscan->scan.scanrelid,
 											  iscan->indexid,
 											  iscan->indexqual,
-											  iscan->indexqualorig);
+											  iscan->indexqualorig,
+											  iscan->indexskipprefixsize);
 		/* and set its cost/width fields appropriately */
 		plan->startup_cost = 0.0;
 		plan->total_cost = ipath->indextotalcost;
@@ -5263,7 +5273,9 @@ make_indexscan(List *qptlist,
 			   List *indexorderby,
 			   List *indexorderbyorig,
 			   List *indexorderbyops,
-			   ScanDirection indexscandir)
+			   ScanDirection indexscandir,
+			   int skipPrefixSize,
+			   bool distinct)
 {
 	IndexScan  *node = makeNode(IndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5280,6 +5292,8 @@ make_indexscan(List *qptlist,
 	node->indexorderbyorig = indexorderbyorig;
 	node->indexorderbyops = indexorderbyops;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
+	node->indexdistinct = distinct;
 
 	return node;
 }
@@ -5292,7 +5306,9 @@ make_indexonlyscan(List *qptlist,
 				   List *indexqual,
 				   List *indexorderby,
 				   List *indextlist,
-				   ScanDirection indexscandir)
+				   ScanDirection indexscandir,
+				   int skipPrefixSize,
+				   bool distinct)
 {
 	IndexOnlyScan *node = makeNode(IndexOnlyScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5307,6 +5323,8 @@ make_indexonlyscan(List *qptlist,
 	node->indexorderby = indexorderby;
 	node->indextlist = indextlist;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
+	node->indexdistinct = distinct;
 
 	return node;
 }
@@ -5315,7 +5333,8 @@ static BitmapIndexScan *
 make_bitmap_indexscan(Index scanrelid,
 					  Oid indexid,
 					  List *indexqual,
-					  List *indexqualorig)
+					  List *indexqualorig,
+					  int skipPrefixSize)
 {
 	BitmapIndexScan *node = makeNode(BitmapIndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5328,6 +5347,7 @@ make_bitmap_indexscan(Index scanrelid,
 	node->indexid = indexid;
 	node->indexqual = indexqual;
 	node->indexqualorig = indexqualorig;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 58386d5040..6a5259926d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3633,12 +3633,18 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 
 	if (parse->distinctClause &&
 		grouping_is_sortable(parse->distinctClause))
+	{
 		root->distinct_pathkeys =
 			make_pathkeys_for_sortclauses(root,
 										  parse->distinctClause,
 										  tlist);
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
+	{
 		root->distinct_pathkeys = NIL;
+		root->query_uniquekeys = NIL;
+	}
 
 	root->sort_pathkeys =
 		make_pathkeys_for_sortclauses(root,
@@ -4832,7 +4838,7 @@ create_distinct_paths(PlannerInfo *root,
 		{
 			Path	   *path = (Path *) lfirst(lc);
 
-			if (query_has_uniquekeys_for(root, needed_pathkeys, false))
+			if (query_has_uniquekeys_for(root, path->uniquekeys, false))
 				add_path(distinct_rel, path);
 		}
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d4abb3cb47..79af18e7a7 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3003,6 +3003,44 @@ create_upper_unique_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_skipscan_unique_path
+ *	  Creates a pathnode the same as an existing IndexPath except based on
+ *	  skipping duplicate values.  This may or may not be cheaper than using
+ *	  create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root, IndexOptInfo *index,
+							Path *basepath, int prefix)
+{
+	IndexPath 	*pathnode = makeNode(IndexPath);
+	int 		numDistinctRows;
+	UniqueKey *ukey;
+
+	Assert(IsA(basepath, IndexPath));
+
+	/* We don't want to modify basepath, so make a copy. */
+	memcpy(pathnode, basepath, sizeof(IndexPath));
+
+	ukey = linitial_node(UniqueKey, root->query_uniquekeys);
+
+	Assert(prefix > 0);
+	pathnode->indexskipprefix = prefix;
+	pathnode->indexdistinct = true;
+	pathnode->path.uniquekeys = root->query_uniquekeys;
+
+	numDistinctRows = estimate_num_groups(root, ukey->exprs,
+										  pathnode->path.rows,
+										  NULL);
+
+	pathnode->path.total_cost = pathnode->path.startup_cost * numDistinctRows;
+	pathnode->path.rows = numDistinctRows;
+
+	return pathnode;
+}
+
 /*
  * create_agg_path
  *	  Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 0b2f9d398a..6fb3648d0f 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -272,6 +272,9 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanskip = (amroutine->amskip != NULL &&
+					amroutine->amgetskiptuple != NULL &&
+					amroutine->ambeginskipscan != NULL);
 			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 031ca0327f..56b2cecdfd 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -952,6 +952,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-skip-scan plans."),
+			NULL
+		},
+		&enable_indexskipscan,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e430e33c7b..95def9aa34 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -355,6 +355,7 @@
 #enable_hashjoin = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexskipscan = on
 #enable_material = on
 #enable_mergejoin = on
 #enable_nestloop = on
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 3c49476483..52432cb24d 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -991,7 +991,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, NULL, NULL);
 
 	if (state->indexInfo->ii_Expressions != NULL)
 	{
@@ -1086,7 +1086,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->indexRel = indexRel;
 	state->enforceUnique = enforceUnique;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, NULL, NULL);
 
 	/* Prepare SortSupport data for each column */
 	state->sortKeys = (SortSupport) palloc0(state->nKeys *
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 4325faa460..7de854649a 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -119,6 +119,12 @@ typedef IndexScanDesc (*ambeginscan_function) (Relation indexRelation,
 											   int nkeys,
 											   int norderbys);
 
+/* prepare for index scan with skip */
+typedef IndexScanDesc (*ambeginscan_skip_function) (Relation indexRelation,
+											   int nkeys,
+											   int norderbys,
+											   int prefix);
+
 /* (re)start index scan */
 typedef void (*amrescan_function) (IndexScanDesc scan,
 								   ScanKey keys,
@@ -130,6 +136,16 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next valid tuple */
+typedef bool (*amgettuple_with_skip_function) (IndexScanDesc scan,
+											   ScanDirection prefixDir,
+											   ScanDirection postfixDir);
+
+/* skip past duplicates */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+								 ScanDirection prefixDir,
+								 ScanDirection postfixDir);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -225,12 +241,15 @@ typedef struct IndexAmRoutine
 	ambuildphasename_function ambuildphasename; /* can be NULL */
 	amvalidate_function amvalidate;
 	ambeginscan_function ambeginscan;
+	ambeginscan_skip_function ambeginskipscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgettuple_with_skip_function amgetskiptuple; /* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
 	amrestrpos_function amrestrpos; /* can be NULL */
+	amskip_function amskip;				/* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 931257bd81..2949d457b6 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -149,9 +149,17 @@ extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
 									 int nkeys, int norderbys);
+extern IndexScanDesc index_beginscan_skip(Relation heapRelation,
+									 Relation indexRelation,
+									 Snapshot snapshot,
+									 int nkeys, int norderbys, int prefix);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
+extern IndexScanDesc index_beginscan_bitmap_skip(Relation indexRelation,
+											Snapshot snapshot,
+											int nkeys,
+											int prefix);
 extern void index_rescan(IndexScanDesc scan,
 						 ScanKey keys, int nkeys,
 						 ScanKey orderbys, int norderbys);
@@ -167,10 +175,16 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
+extern ItemPointer index_getnext_tid_skip(IndexScanDesc scan,
+									 ScanDirection prefixDir,
+									 ScanDirection postfixDir);
 struct TupleTableSlot;
 extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 							   struct TupleTableSlot *slot);
+extern bool index_getnext_slot_skip(IndexScanDesc scan, ScanDirection prefixDir,
+									ScanDirection postfixDir,
+									struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
@@ -180,6 +194,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
 												   IndexBulkDeleteResult *stats);
 extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection prefixDir,
+					   ScanDirection postfixDir);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 79506c748b..1e86829063 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -901,6 +901,53 @@ typedef struct BTArrayKeyInfo
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
 
+typedef struct BTSkipCompareResult
+{
+	bool		equal;
+	int			prefixCmpResult, skCmpResult;
+	bool		prefixSkip, fullKeySkip;
+	int			prefixSkipIndex;
+} BTSkipCompareResult;
+
+typedef enum BTSkipState
+{
+	SkipStateStop,
+	SkipStateSkip,
+	SkipStateSkipExtra,
+	SkipStateNext
+} BTSkipState;
+
+typedef struct BTSkipPosData
+{
+	BTSkipState nextAction;
+	ScanDirection nextDirection;
+	int nextSkipIndex;
+	BTScanInsertData skipScanKey;
+} BTSkipPosData;
+
+typedef struct BTSkipData
+{
+	/* used to control skipping
+	 * skipScanKey is a combination of currentTupleKey and fwdScanKey/bwdScanKey.
+	 * currentTupleKey contains the scan keys for the current tuple
+	 * fwdScanKey contains the scan keys for quals that would be chosen for a forward scan
+	 * bwdScanKey contains the scan keys for quals that would be chosen for a backward scan
+	 * we need both fwd and bwd, because the scan keys differ for going fwd and bwd
+	 * if a qual would be a>2 and a<5, fwd would have a>2, while bwd would have a<5
+	 */
+	BTScanInsertData	currentTupleKey;
+	BTScanInsertData	fwdScanKey;
+	ScanKeyData			fwdNotNullKeys[INDEX_MAX_KEYS];
+	BTScanInsertData	bwdScanKey;
+	ScanKeyData			bwdNotNullKeys[INDEX_MAX_KEYS];
+	/* length of prefix to skip */
+	int					prefix;
+
+	BTSkipPosData curPos, markPos;
+} BTSkipData;
+
+typedef BTSkipData *BTSkip;
+
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -938,6 +985,9 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* Work space for _bt_skip */
+	BTSkip	skipData;	/* used to control skipping */
+
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -952,6 +1002,8 @@ typedef BTScanOpaqueData *BTScanOpaque;
  */
 #define SK_BT_REQFWD	0x00010000	/* required to continue forward scan */
 #define SK_BT_REQBKWD	0x00020000	/* required to continue backward scan */
+#define SK_BT_REQSKIPFWD	0x00040000	/* required to continue forward scan within current prefix */
+#define SK_BT_REQSKIPBKWD	0x00080000	/* required to continue backward scan within current prefix */
 #define SK_BT_INDOPTION_SHIFT  24	/* must clear the above bits */
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -998,9 +1050,12 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 					 IndexUniqueCheck checkUnique,
 					 struct IndexInfo *indexInfo);
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc btbeginscan_skip(Relation rel, int nkeys, int norderbys, int skipPrefix);
 extern Size btestimateparallelscan(void);
 extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool btgettuple_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
+extern bool btskip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
@@ -1093,15 +1148,79 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
 							bool forupdate, BTStack stack, int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_first(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
+extern bool _bt_next(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 							   Snapshot snapshot);
+extern Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+extern void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+extern bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
+						 OffsetNumber *offnum, bool isRegularMode);
+extern bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
+extern void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+
+/*
+* prototypes for functions in nbtskip.c
+*/
+static inline bool
+_bt_skip_enabled(BTScanOpaque so)
+{
+	return so->skipData != NULL;
+}
+
+static inline bool
+_bt_skip_is_regular_mode(ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	return prefixDir == postfixDir;
+}
+
+/* returns whether or not we can use extra quals in the scankey after skipping to a prefix */
+static inline bool
+_bt_has_extra_quals_after_skip(BTSkip skip, ScanDirection dir, int prefix)
+{
+	if (ScanDirectionIsForward(dir))
+	{
+		return skip->fwdScanKey.keysz > prefix;
+	}
+	else
+	{
+		return skip->bwdScanKey.keysz > prefix;
+	}
+}
+
+/* alias of BTScanPosIsValid */
+static inline bool
+_bt_skip_is_always_valid(BTScanOpaque so)
+{
+	return BTScanPosIsValid(so->currPos);
+}
+
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
+extern void _bt_skip_create_scankeys(Relation rel, BTScanOpaque so);
+extern void _bt_skip_update_scankey_for_extra_skip(IndexScanDesc scan, Relation indexRel,
+					ScanDirection curDir, ScanDirection prefixDir, bool prioritizeEqual, IndexTuple itup);
+extern void _bt_skip_once(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+						  bool forceSkip, ScanDirection prefixDir, ScanDirection postfixDir);
+extern void _bt_skip_extra_conditions(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+									  ScanDirection prefixDir, ScanDirection postfixDir, BTSkipCompareResult *cmp);
+extern bool _bt_skip_find_next(IndexScanDesc scan, IndexTuple curTuple, OffsetNumber curTupleOffnum,
+							   ScanDirection prefixDir, ScanDirection postfixDir);
+extern void _bt_skip_until_match(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+								 ScanDirection prefixDir, ScanDirection postfixDir);
+extern bool _bt_has_results(BTScanOpaque so);
+extern void _bt_compare_current_item(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+									 ScanDirection dir, bool isRegularMode, BTSkipCompareResult* cmp);
+extern bool _bt_step_back_page(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum);
+extern bool _bt_step_forward_page(IndexScanDesc scan, BlockNumber next, IndexTuple *curTuple,
+								  OffsetNumber *curTupleOffnum);
+extern bool _bt_checkkeys_skip(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+							   ScanDirection dir, bool *continuescan, int *prefixskipindex);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -1110,7 +1229,7 @@ extern void _bt_mark_array_keys(IndexScanDesc scan);
 extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
 extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan);
+						  int tupnatts, ScanDirection dir, bool *continuescan, int *indexSkipPrefix);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
@@ -1132,6 +1251,19 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
 extern bool _bt_allequalimage(Relation rel, bool debugmessage);
+extern bool _bt_checkkeys_threeway(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+				ScanDirection dir, bool *continuescan, int *prefixSkipIndex);
+extern bool _bt_create_insertion_scan_key(Relation	rel, ScanDirection dir,
+				ScanKey* startKeys, int keysCount,
+				BTScanInsert inskey, StrategyNumber* stratTotal,
+				bool* goback);
+extern void _bt_set_bsearch_flags(StrategyNumber stratTotal, ScanDirection dir,
+		bool* nextkey, bool* goback);
+extern int _bt_choose_scan_keys(ScanKey scanKeys, int numberOfKeys, ScanDirection dir,
+ScanKey* startKeys, ScanKeyData* notnullkeys,
+  StrategyNumber* stratTotal, int prefix);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup, BTScanInsert key);
+extern void print_itup(BlockNumber blk, IndexTuple left, IndexTuple right, Relation rel, char *extra);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c7deeac662..fe7729bf75 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -429,9 +429,13 @@ extern Datum ExecMakeFunctionResultSet(SetExprState *fcache,
  */
 typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
 typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
+typedef bool (*ExecScanSkipMtd) (ScanState *node);
 
 extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
 								ExecScanRecheckMtd recheckMtd);
+extern TupleTableSlot *ExecScanExtended(ScanState *node, ExecScanAccessMtd accessMtd,
+								ExecScanRecheckMtd recheckMtd,
+								ExecScanSkipMtd skipMtd);
 extern void ExecAssignScanProjectionInfo(ScanState *node);
 extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, Index varno);
 extern void ExecScanReScan(ScanState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6f96b31fb4..03fd532794 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1333,6 +1333,7 @@ typedef struct ScanState
 	Relation	ss_currentRelation;
 	struct TableScanDescData *ss_currentScanDesc;
 	TupleTableSlot *ss_ScanTupleSlot;
+	bool ss_FirstTupleEmitted;
 } ScanState;
 
 /* ----------------
@@ -1429,6 +1430,8 @@ typedef struct IndexScanState
 	ExprContext *iss_RuntimeContext;
 	Relation	iss_RelationDesc;
 	struct IndexScanDescData *iss_ScanDesc;
+	int			iss_SkipPrefixSize;
+	bool		iss_Distinct;
 
 	/* These are needed for re-checking ORDER BY expr ordering */
 	pairingheap *iss_ReorderQueue;
@@ -1458,6 +1461,8 @@ typedef struct IndexScanState
  *		TableSlot		   slot for holding tuples fetched from the table
  *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
+ *		SkipPrefixSize	   number of keys for skip-based DISTINCT
+ *		FirstTupleEmitted  has the first tuple been emitted
  * ----------------
  */
 typedef struct IndexOnlyScanState
@@ -1476,6 +1481,8 @@ typedef struct IndexOnlyScanState
 	struct IndexScanDescData *ioss_ScanDesc;
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
+	int			ioss_SkipPrefixSize;
+	bool		ioss_Distinct;
 	Size		ioss_PscanLen;
 } IndexOnlyScanState;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index a5c406bd4e..e5101328b7 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1229,6 +1229,9 @@ typedef struct Path
  * we need not recompute them when considering using the same index in a
  * bitmap index/heap scan (see BitmapHeapPath).  The costs of the IndexPath
  * itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
  *----------
  */
 typedef struct IndexPath
@@ -1241,6 +1244,8 @@ typedef struct IndexPath
 	ScanDirection indexscandir;
 	Cost		indextotalcost;
 	Selectivity indexselectivity;
+	int			indexskipprefix;
+	bool		indexdistinct;
 } IndexPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 83e01074ed..f209b7963b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -409,6 +409,8 @@ typedef struct IndexScan
 	List	   *indexorderbyorig;	/* the same in original form */
 	List	   *indexorderbyops;	/* OIDs of sort ops for ORDER BY exprs */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for skip scans */
+	bool		indexdistinct; /* whether only distinct keys are requested */
 } IndexScan;
 
 /* ----------------
@@ -436,6 +438,8 @@ typedef struct IndexOnlyScan
 	List	   *indexorderby;	/* list of index ORDER BY exprs */
 	List	   *indextlist;		/* TargetEntry list describing index's cols */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for skip scans */
+	bool		indexdistinct; /* whether only distinct keys are requested */
 } IndexOnlyScan;
 
 /* ----------------
@@ -462,6 +466,7 @@ typedef struct BitmapIndexScan
 	bool		isshared;		/* Create shared bitmap if set */
 	List	   *indexqual;		/* list of index quals (OpExprs) */
 	List	   *indexqualorig;	/* the same in original form */
+	int			indexskipprefixsize;	/* the size of the prefix for skip scans */
 } BitmapIndexScan;
 
 /* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 613db8eab6..c0f176eaaa 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
 extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 6796ad8cb7..8ec1780a56 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -207,6 +207,10 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
 												 Path *subpath,
 												 int numCols,
 												 double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+											  IndexOptInfo *index,
+											  Path *subpath,
+											  int prefix);
 extern AggPath *create_agg_path(PlannerInfo *root,
 								RelOptInfo *rel,
 								Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 0cb8030e33..f934f0011a 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -198,6 +198,10 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   Relids required_outer,
 													   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
+extern int find_index_prefix_for_pathkey(PlannerInfo *root,
+					 IndexOptInfo *index,
+					 ScanDirection scandir,
+					 PathKey *pathkey);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 								  ScanDirection scandir);
 extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
diff --git a/src/interfaces/libpq/encnames.c b/src/interfaces/libpq/encnames.c
new file mode 120000
index 0000000000..ca78618b55
--- /dev/null
+++ b/src/interfaces/libpq/encnames.c
@@ -0,0 +1 @@
+../../../src/backend/utils/mb/encnames.c
\ No newline at end of file
diff --git a/src/interfaces/libpq/wchar.c b/src/interfaces/libpq/wchar.c
new file mode 120000
index 0000000000..a27508f72a
--- /dev/null
+++ b/src/interfaces/libpq/wchar.c
@@ -0,0 +1 @@
+../../../src/backend/utils/mb/wchar.c
\ No newline at end of file
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index 11c6f50fbf..bdcf75fe8e 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -306,3 +306,602 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
  t
 (1 row)
 
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+SELECT DISTINCT a FROM distinct_a;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+ a 
+---
+ 1
+(1 row)
+
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Index Only Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: Distinct only
+(2 rows)
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: Distinct only
+   Index Cond: (b = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: Distinct only
+   Index Cond: (b = 2)
+(3 rows)
+
+DROP INDEX distinct_a_b_a;
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b 
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+                          QUERY PLAN                          
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: Distinct only
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+                              QUERY PLAN                               
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: Distinct only
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+ 2 | 1 | 10
+ 3 | 1 | 10
+ 4 | 1 | 10
+ 5 | 1 | 10
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: Distinct only
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: Distinct only
+   Index Cond: (a = 1)
+(3 rows)
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: Distinct only
+   Index Cond: (b = 2)
+   Filter: (c = 10)
+(4 rows)
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+ a | a 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 3
+ 4 | 4
+ 5 | 5
+(5 rows)
+
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+ a | ?column? 
+---+----------
+ 1 |        1
+ 2 |        1
+ 3 |        1
+ 4 |        1
+ 5 |        1
+(5 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+FETCH FROM c;
+ a 
+---
+ 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a 
+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 |  9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+                                 QUERY PLAN                                 
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+   Skip scan: Distinct only
+   Index Cond: ((b >= 1) AND (c = 0))
+(3 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c 
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
+-- test tuple killing
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 1 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 5 | 1 | 1 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 5 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 1 | 1 | 1 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(5 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(5 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 06c4c3e476..e64e20a8cb 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -79,6 +79,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexskipscan           | on
  enable_material                | on
  enable_mergejoin               | on
  enable_nestloop                | on
@@ -90,7 +91,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(18 rows)
+(19 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index 33102744eb..0227c98823 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -135,3 +135,251 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
 SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
 SELECT 2 IS NOT DISTINCT FROM null as "no";
 SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+
+SELECT DISTINCT a FROM distinct_a;
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+DROP INDEX distinct_a_b_a;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
+
+-- test tuple killing
+
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
-- 
2.27.0

v01-0003-Support-skip-scan-for-non-distinct-scans.patchapplication/octet-stream; name=v01-0003-Support-skip-scan-for-non-distinct-scans.patchDownload

From 0fb9764c651ec2b5ce19147000de18877bf2dd3b Mon Sep 17 00:00:00 2001
From: Floris van Nee <floris.vannee@gmail.com>
Date: Thu, 19 Mar 2020 10:27:47 +0100
Subject: [PATCH 5/5] Support skip scan for non-distinct scans

Adds planner support to choose a skip scan for regular
non-distinct queries like:
SELECT * FROM t1 WHERE b=1 (with index on (a,b))
---
 src/backend/optimizer/path/indxpath.c | 181 +++++++++++++++++++++++++-
 src/backend/optimizer/plan/planner.c  |   2 +-
 src/backend/optimizer/util/pathnode.c |   4 +-
 src/backend/utils/adt/selfuncs.c      | 153 ++++++++++++++++++++--
 src/include/optimizer/pathnode.h      |   3 +-
 5 files changed, 327 insertions(+), 16 deletions(-)

diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 216dd4611e..9bf7bd73f1 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -188,6 +188,17 @@ static Expr *match_clause_to_ordering_op(IndexOptInfo *index,
 static bool ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel,
 									   EquivalenceClass *ec, EquivalenceMember *em,
 									   void *arg);
+static List* add_possible_index_skip_paths(List* result,
+										  PlannerInfo *root,
+										  IndexOptInfo *index,
+										  List *indexclauses,
+										  List *indexorderbys,
+										  List *indexorderbycols,
+										  List *pathkeys,
+										  ScanDirection indexscandir,
+										  bool indexonly,
+										  Relids required_outer,
+										  double loop_count);
 
 
 /*
@@ -816,6 +827,136 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	}
 }
 
+/*
+ * Find available index skip paths and add them to the path list
+ */
+static List* add_possible_index_skip_paths(List* result,
+										  PlannerInfo *root,
+										  IndexOptInfo *index,
+										  List *indexclauses,
+										  List *indexorderbys,
+										  List *indexorderbycols,
+										  List *pathkeys,
+										  ScanDirection indexscandir,
+										  bool indexonly,
+										  Relids required_outer,
+										  double loop_count)
+{
+	int			indexcol;
+	bool		eqQualHere;
+	bool		eqQualPrev;
+	bool		eqSoFar;
+	ListCell   *lc;
+
+	/*
+	 * We need to find possible prefixes to use for the skip scan
+	 * Any useful prefix is one just before an index clause, unless
+	 * all clauses so far have been equal.
+	 * For example, on an index (a,b,c), the qual b=1 would
+	 * mean that an interesting skip prefix could be 1.
+	 * For qual a=1 AND b=1, it is not interesting to skip with
+	 * prefix 1, because the value of a is fixed already.
+	 */
+	indexcol = 0;
+	eqQualHere = false;
+	eqQualPrev = false;
+	eqSoFar = true;
+	foreach(lc, indexclauses)
+	{
+		IndexClause *iclause = lfirst_node(IndexClause, lc);
+		ListCell   *lc2;
+
+		if (indexcol != iclause->indexcol)
+		{
+			if (!eqQualHere)
+				eqSoFar = false;
+
+			/* Beginning of a new column's quals */
+			if (!eqQualPrev && !eqSoFar)
+			{
+				/* We have a qual on current column,
+				 * there is no equality qual on the previous column,
+				 * not all of the previous quals are equality so far
+				 * (last one is special case for the first column in the index).
+				 * Optimal conditions to try an index skip path.
+				 */
+				IndexPath *ipath = create_index_path(root, index,
+										  indexclauses,
+										  indexorderbys,
+										  indexorderbycols,
+										  pathkeys,
+										  indexscandir,
+										  indexonly,
+										  required_outer,
+										  loop_count,
+										  false,
+										  iclause->indexcol);
+				result = lappend(result, ipath);
+			}
+
+			eqQualPrev = eqQualHere;
+			eqQualHere = false;
+			indexcol++;
+			/* if the clause is not for this index col, increment until it is */
+			while (indexcol != iclause->indexcol)
+			{
+				eqQualPrev = false;
+				eqSoFar = false;
+				indexcol++;
+			}
+		}
+
+		/* Examine each indexqual associated with this index clause */
+		foreach(lc2, iclause->indexquals)
+		{
+			RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc2);
+			Expr	   *clause = rinfo->clause;
+			Oid			clause_op = InvalidOid;
+			int			op_strategy;
+
+			if (IsA(clause, OpExpr))
+			{
+				OpExpr	   *op = (OpExpr *) clause;
+				clause_op = op->opno;
+			}
+			else if (IsA(clause, RowCompareExpr))
+			{
+				RowCompareExpr *rc = (RowCompareExpr *) clause;
+				clause_op = linitial_oid(rc->opnos);
+			}
+			else if (IsA(clause, ScalarArrayOpExpr))
+			{
+				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
+				clause_op = saop->opno;
+			}
+			else if (IsA(clause, NullTest))
+			{
+				NullTest   *nt = (NullTest *) clause;
+
+				if (nt->nulltesttype == IS_NULL)
+				{
+					/* IS NULL is like = for selectivity purposes */
+					eqQualHere = true;
+				}
+			}
+			else
+				elog(ERROR, "unsupported indexqual type: %d",
+					 (int) nodeTag(clause));
+
+			/* check for equality operator */
+			if (OidIsValid(clause_op))
+			{
+				op_strategy = get_op_opfamily_strategy(clause_op,
+													   index->opfamily[indexcol]);
+				Assert(op_strategy != 0);	/* not a member of opfamily?? */
+				if (op_strategy == BTEqualStrategyNumber)
+					eqQualHere = true;
+			}
+		}
+	}
+	return result;
+}
+
 /*
  * build_index_paths
  *	  Given an index and a set of index clauses for it, construct zero
@@ -1051,9 +1192,25 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  index_only_scan,
 								  outer_relids,
 								  loop_count,
-								  false);
+								  false,
+								  0);
 		result = lappend(result, ipath);
 
+		if (can_skip)
+		{
+			result = add_possible_index_skip_paths(result, root, index,
+												   index_clauses,
+												   orderbyclauses,
+												   orderbyclausecols,
+												   useful_pathkeys,
+												   index_is_ordered ?
+												   ForwardScanDirection :
+												   NoMovementScanDirection,
+												   index_only_scan,
+												   outer_relids,
+												   loop_count);
+		}
+
 		/* Consider index skip scan as well */
 		if (root->query_uniquekeys != NULL && can_skip)
 		{
@@ -1100,7 +1257,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  index_only_scan,
 									  outer_relids,
 									  loop_count,
-									  true);
+									  true,
+									  0);
 
 			/*
 			 * if, after costing the path, we find that it's not worth using
@@ -1133,9 +1291,23 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  index_only_scan,
 									  outer_relids,
 									  loop_count,
-									  false);
+									  false,
+									  0);
 			result = lappend(result, ipath);
 
+			if (can_skip)
+			{
+				result = add_possible_index_skip_paths(result, root, index,
+													   index_clauses,
+													   NIL,
+													   NIL,
+													   useful_pathkeys,
+													   BackwardScanDirection,
+													   index_only_scan,
+													   outer_relids,
+													   loop_count);
+			}
+
 			/* Consider index skip scan as well */
 			if (root->query_uniquekeys != NULL && can_skip)
 			{
@@ -1177,7 +1349,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 										  index_only_scan,
 										  outer_relids,
 										  loop_count,
-										  true);
+										  true,
+										  0);
 
 				/*
 				 * if, after costing the path, we find that it's not worth
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 6a5259926d..da2bdcd8d3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6356,7 +6356,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 	indexScanPath = create_index_path(root, indexInfo,
 									  NIL, NIL, NIL, NIL,
 									  ForwardScanDirection, false,
-									  NULL, 1.0, false);
+									  NULL, 1.0, false, 0);
 
 	return (seqScanAndSortPath.total_cost < indexScanPath->path.total_cost);
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 79af18e7a7..11d3b71ea5 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1019,7 +1019,8 @@ create_index_path(PlannerInfo *root,
 				  bool indexonly,
 				  Relids required_outer,
 				  double loop_count,
-				  bool partial_path)
+				  bool partial_path,
+				  int skip_prefix)
 {
 	IndexPath  *pathnode = makeNode(IndexPath);
 	RelOptInfo *rel = index->rel;
@@ -1039,6 +1040,7 @@ create_index_path(PlannerInfo *root,
 	pathnode->indexorderbys = indexorderbys;
 	pathnode->indexorderbycols = indexorderbycols;
 	pathnode->indexscandir = indexscandir;
+	pathnode->indexskipprefix = skip_prefix;
 
 	cost_index(pathnode, root, loop_count, partial_path);
 
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index be08eb4814..24ea521237 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -210,7 +210,9 @@ static bool get_actual_variable_endpoint(Relation heapRel,
 										 MemoryContext outercontext,
 										 Datum *endpointDatum);
 static RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids);
-
+static double estimate_num_groups_internal(PlannerInfo *root, List *groupExprs,
+									double input_rows, double rel_input_rows,
+									List **pgset);
 
 /*
  *		eqsel			- Selectivity of "=" for any data types.
@@ -3356,6 +3358,19 @@ add_unique_group_var(PlannerInfo *root, List *varinfos,
 double
 estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 					List **pgset)
+{
+	return estimate_num_groups_internal(root, groupExprs, input_rows, -1, pgset);
+}
+
+/*
+ * Same as estimate_num_groups, but with an extra argument to control
+ * the estimation used for the input rows of the relation. If
+ * rel_input_rows < 0, it uses the the original planner estimation for the
+ * individual rels, else if uses the estimation as provided to the function.
+ */
+static double
+estimate_num_groups_internal(PlannerInfo *root, List *groupExprs, double input_rows, double rel_input_rows,
+					List **pgset)
 {
 	List	   *varinfos = NIL;
 	double		srf_multiplier = 1.0;
@@ -3510,6 +3525,12 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 		int			relvarcount = 0;
 		List	   *newvarinfos = NIL;
 		List	   *relvarinfos = NIL;
+		double this_rel_input_rows;
+
+		if (rel_input_rows < 0.0)
+			this_rel_input_rows = rel->rows;
+		else
+			this_rel_input_rows = rel_input_rows;
 
 		/*
 		 * Split the list of varinfos in two - one for the current rel, one
@@ -3607,7 +3628,7 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 			 * guarding against division by zero when reldistinct is zero.
 			 * Also skip this if we know that we are returning all rows.
 			 */
-			if (reldistinct > 0 && rel->rows < rel->tuples)
+			if (reldistinct > 0 && this_rel_input_rows < rel->tuples)
 			{
 				/*
 				 * Given a table containing N rows with n distinct values in a
@@ -3644,7 +3665,7 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 				 * works well even when n is small.
 				 */
 				reldistinct *=
-					(1 - pow((rel->tuples - rel->rows) / rel->tuples,
+					(1 - pow((rel->tuples - this_rel_input_rows) / rel->tuples,
 							 rel->tuples / reldistinct));
 			}
 			reldistinct = clamp_row_est(reldistinct);
@@ -6244,8 +6265,10 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	double		numIndexTuples;
 	Cost		descentCost;
 	List	   *indexBoundQuals;
+	List	   *prefixBoundQuals;
 	int			indexcol;
 	bool		eqQualHere;
+	bool		stillEq;
 	bool		found_saop;
 	bool		found_is_null_op;
 	double		num_sa_scans;
@@ -6269,9 +6292,11 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
+	prefixBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
+	stillEq = true;
 	found_is_null_op = false;
 	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
@@ -6283,11 +6308,18 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 		{
 			/* Beginning of a new column's quals */
 			if (!eqQualHere)
-				break;			/* done if no '=' qual for indexcol */
+			{
+				stillEq = false;
+				/* done if no '=' qual for indexcol and we're past the skip prefix */
+				if (path->indexskipprefix <= indexcol)
+					break;
+			}
 			eqQualHere = false;
 			indexcol++;
+			while (indexcol != iclause->indexcol && path->indexskipprefix > indexcol)
+				indexcol++;
 			if (indexcol != iclause->indexcol)
-				break;			/* no quals at all for indexcol */
+				break; /* no quals at all for indexcol */
 		}
 
 		/* Examine each indexqual associated with this index clause */
@@ -6319,7 +6351,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 				clause_op = saop->opno;
 				found_saop = true;
 				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
+				if (alength > 1 && stillEq)
 					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
@@ -6347,7 +6379,14 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 					eqQualHere = true;
 			}
 
-			indexBoundQuals = lappend(indexBoundQuals, rinfo);
+			/* we keep two lists here, one with all quals up until the prefix
+			 * and one with only the quals until the first inequality.
+			 * we need the list with prefixes later
+			 */
+			if (stillEq)
+				indexBoundQuals = lappend(indexBoundQuals, rinfo);
+			if (path->indexskipprefix > 0)
+				prefixBoundQuals = lappend(prefixBoundQuals, rinfo);
 		}
 	}
 
@@ -6373,7 +6412,10 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 		 * index-bound quals to produce a more accurate idea of the number of
 		 * rows covered by the bound conditions.
 		 */
-		selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
+		if (path->indexskipprefix > 0)
+			selectivityQuals = add_predicate_to_index_quals(index, prefixBoundQuals);
+		else
+			selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
 
 		btreeSelectivity = clauselist_selectivity(root, selectivityQuals,
 												  index->rel->relid,
@@ -6383,7 +6425,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 		/*
 		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
+		 * ScalarArrayOpExpr quals included in prefixBoundQuals, and then round
 		 * to integer.
 		 */
 		numIndexTuples = rint(numIndexTuples / num_sa_scans);
@@ -6429,6 +6471,99 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	costs.indexStartupCost += descentCost;
 	costs.indexTotalCost += costs.num_sa_scans * descentCost;
 
+	/*
+	 * Add extra costs for using an index skip scan.
+	 * The index skip scan could have significantly lower cost until now,
+	 * due to the different row estimation used (all the quals up to prefix,
+	 * rather than all the quals up to the first non-equality operator).
+	 * However, there are extra costs incurred for
+	 * a) setting up the scan
+	 * b) doing additional scans from root
+	 * c) small extra cost per tuple comparison
+	 * We add those here
+	 */
+	if (path->indexskipprefix > 0)
+	{
+		List *exprlist = NULL;
+		double numgroups_estimate;
+		int i = 0;
+		ListCell *indexpr_item = list_head(path->indexinfo->indexprs);
+		List	   *selectivityQuals;
+		Selectivity btreeSelectivity;
+		double estimatedIndexTuplesNoPrefix;
+
+		/* some rather arbitrary extra cost for preprocessing structures needed for skip scan */
+		costs.indexStartupCost += 200.0 * cpu_operator_cost;
+		costs.indexTotalCost += 200.0 * cpu_operator_cost;
+
+		/*
+		 * In order to reliably get a cost estimation for the number of scans we have to do from root,
+		 * we need some estimation on the number of distinct prefixes that exist. Therefore, we need
+		 * a different selectivity approximation (this time we do need to use the clauses until the first
+		 * non-equality operator). Using that, we can estimate the number of groups
+		 */
+		for (i = 0; i < path->indexinfo->nkeycolumns && i < path->indexskipprefix; i++)
+		{
+			Expr *expr = NULL;
+			int attr = path->indexinfo->indexkeys[i];
+			if(attr > 0)
+			{
+				TargetEntry *tentry = get_tle_by_resno(path->indexinfo->indextlist, i + 1);
+				Assert(tentry != NULL);
+				expr = tentry->expr;
+			}
+			else if (attr == 0)
+			{
+				/* Expression index */
+				expr = lfirst(indexpr_item);
+				indexpr_item = lnext(path->indexinfo->indexprs, indexpr_item);
+			}
+			else /* attr < 0 */
+			{
+				/* Index on system column is not supported */
+				Assert(false);
+			}
+
+			exprlist = lappend(exprlist, expr);
+		}
+
+		selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
+
+		btreeSelectivity = clauselist_selectivity(root, selectivityQuals,
+												  index->rel->relid,
+												  JOIN_INNER,
+												  NULL);
+		estimatedIndexTuplesNoPrefix = btreeSelectivity * index->rel->tuples;
+
+		/*
+		 * As in genericcostestimate(), we have to adjust for any
+		 * ScalarArrayOpExpr quals included in prefixBoundQuals, and then round
+		 * to integer.
+		 */
+		estimatedIndexTuplesNoPrefix = rint(estimatedIndexTuplesNoPrefix / num_sa_scans);
+
+		numgroups_estimate = estimate_num_groups_internal(
+					root, exprlist, estimatedIndexTuplesNoPrefix,
+					estimatedIndexTuplesNoPrefix, NULL);
+
+		/*
+		 * For each distinct prefix value we add descending cost as.
+		 * This is similar to the startup cost calculation for regular scans.
+		 * We can do at most 2 scans from root per distinct prefix, so multiply by 2.
+		 * Also add some CPU processing cost per page that we need to process, plus
+		 * some additional one-time cost for scanning the leaf page. This is a more
+		 * expensive estimation than the per-page cpu cost for the regular index scan.
+		 * This is intentional, because the index skip scan does more processing on
+		 * the leaf page.
+		 */
+		if (index->tuples > 0)
+			descentCost = ceil(log(index->tuples) / log(2.0)) * cpu_operator_cost * 2;
+		else
+			descentCost = 0;
+		descentCost += (index->tree_height + 1) * 50.0 * cpu_operator_cost * 2 + 200 * cpu_operator_cost;
+		costs.indexTotalCost += costs.num_sa_scans * descentCost * numgroups_estimate;
+	}
+
 	/*
 	 * If we can get an estimate of the first column's ordering correlation C
 	 * from pg_statistic, estimate the index correlation as C for a
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 8ec1780a56..4fa9adbfee 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -49,7 +49,8 @@ extern IndexPath *create_index_path(PlannerInfo *root,
 									bool indexonly,
 									Relids required_outer,
 									double loop_count,
-									bool partial_path);
+									bool partial_path,
+									int skip_prefix);
 extern BitmapHeapPath *create_bitmap_heap_path(PlannerInfo *root,
 											   RelOptInfo *rel,
 											   Path *bitmapqual,
-- 
2.27.0

David Rowley

dgrowleyml@gmail.com

over 5 years ago

In reply to: Floris Van Nee (#1)

Re: Generic Index Skip Scan

On Thu, 16 Jul 2020 at 07:52, Floris Van Nee <florisvannee@optiver.com> wrote:

Besides the great efforts that Dmitry et al. are putting into the skip scan for DISTINCT queries [1], I’m also still keen on extending the use of it further. I’d like to address the limited cases in which skipping can occur here. A few months ago I shared an initial rough patch that provided a generic skip implementation, but lacked the proper planning work [2]. I’d like to share a second patch set that provides an implementation of the planner as well. Perhaps this can lead to some proper discussions how we’d like to shape this patch further.

Please see [2] for an introduction and some rough performance comparisons. This patch improves upon those, because it implements proper cost estimation logic. It will now only choose the skip scan if it’s deemed to be cheaper than using a regular index scan. Other than that, all the features are still there. The skip scan can be used in many more types of queries than in the original DISTINCT patch as provided in [1], making it more performant and also more predictable for users.

I’m keen on receiving feedback on this idea and on the patch.

I don't think anyone ever thought the feature would be limited to just
making DISTINCT go faster. There's certain to be more usages in the
future.

However, for me it would be premature to look at this now.

David

Floris Van Nee

florisvannee@Optiver.com

over 5 years ago

In reply to: Floris Van Nee (#1)

5 attachment(s)

RE: Generic Index Skip Scan

Attached v02, rebased on latest master. It uses the new nbtree lock/unlock functions by Peter and I've verified with Valgrind there's no cases where it's trying to access pages without holding the lock. Just for debugging, I ran the Valgrind session on a modified version of the patch that always favors a skip scan over a regular index scan, in order to greatly increase the Valgrind coverage of the new parts.

-Floris

Attachments:

v9-0001-Introduce-RelOptInfo-notnullattrs-attribute.patchapplication/octet-stream; name=v9-0001-Introduce-RelOptInfo-notnullattrs-attribute.patchDownload

From 64521debc928b10c48c32a34a4fe4fc10f318fdd Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Sun, 3 May 2020 22:37:46 +0800
Subject: [PATCH 1/5] Introduce RelOptInfo->notnullattrs attribute

The notnullattrs is calculated from catalog and run-time query. That
infomation is translated to child relation as well for partitioned
table.
---
 src/backend/optimizer/path/allpaths.c  | 31 ++++++++++++++++++++++++++
 src/backend/optimizer/plan/initsplan.c | 10 +++++++++
 src/backend/optimizer/util/plancat.c   | 10 +++++++++
 src/include/nodes/pathnodes.h          |  2 ++
 4 files changed, 53 insertions(+)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 6da0dcd61c..484dab0a1a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1005,6 +1005,7 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 		RelOptInfo *childrel;
 		ListCell   *parentvars;
 		ListCell   *childvars;
+		int i = -1;
 
 		/* append_rel_list contains all append rels; ignore others */
 		if (appinfo->parent_relid != parentRTindex)
@@ -1061,6 +1062,36 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 								   (Node *) rel->reltarget->exprs,
 								   1, &appinfo);
 
+		/* Copy notnullattrs. */
+		while ((i = bms_next_member(rel->notnullattrs, i)) > 0)
+		{
+			AttrNumber attno = i + FirstLowInvalidHeapAttributeNumber;
+			AttrNumber child_attno;
+			if (attno == 0)
+			{
+				/* Whole row is not null, so must be same for child */
+				childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+														attno - FirstLowInvalidHeapAttributeNumber);
+				break;
+			}
+			if (attno < 0 )
+				/* no need to translate system column */
+				child_attno = attno;
+			else
+			{
+				Node * node = list_nth(appinfo->translated_vars, attno - 1);
+				if (!IsA(node, Var))
+					/* This may happens at UNION case, like (SELECT a FROM t1 UNION SELECT a + 3
+					 * FROM t2) t and we know t.a is not null
+					 */
+					continue;
+				child_attno = castNode(Var, node)->varattno;
+			}
+
+			childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+													child_attno - FirstLowInvalidHeapAttributeNumber);
+		}
+
 		/*
 		 * We have to make child entries in the EquivalenceClass data
 		 * structures as well.  This is needed either if the parent
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index e978b491f6..95b1b14cd3 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -830,6 +830,16 @@ deconstruct_recurse(PlannerInfo *root, Node *jtnode, bool below_outer_join,
 		{
 			Node	   *qual = (Node *) lfirst(l);
 
+			/* Set the not null info now */
+			ListCell	*lc;
+			List		*non_nullable_vars = find_nonnullable_vars(qual);
+			foreach(lc, non_nullable_vars)
+			{
+				Var *var = lfirst_node(Var, lc);
+				RelOptInfo *rel = root->simple_rel_array[var->varno];
+				rel->notnullattrs = bms_add_member(rel->notnullattrs,
+												   var->varattno - FirstLowInvalidHeapAttributeNumber);
+			}
 			distribute_qual_to_rels(root, qual,
 									false, below_outer_join, JOIN_INNER,
 									root->qual_security_level,
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 25545029d7..0b2f9d398a 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -117,6 +117,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	Relation	relation;
 	bool		hasindex;
 	List	   *indexinfos = NIL;
+	int			i;
 
 	/*
 	 * We need not lock the relation since it was already locked, either by
@@ -463,6 +464,15 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	if (inhparent && relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		set_relation_partition_info(root, rel, relation);
 
+	Assert(rel->notnullattrs == NULL);
+	for(i = 0; i < relation->rd_att->natts; i++)
+	{
+		FormData_pg_attribute attr = relation->rd_att->attrs[i];
+		if (attr.attnotnull)
+			rel->notnullattrs = bms_add_member(rel->notnullattrs,
+											   attr.attnum - FirstLowInvalidHeapAttributeNumber);
+	}
+
 	table_close(relation, NoLock);
 
 	/*
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 485d1b06c9..9e3ebd488a 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -709,6 +709,8 @@ typedef struct RelOptInfo
 	PlannerInfo *subroot;		/* if subquery */
 	List	   *subplan_params; /* if subquery */
 	int			rel_parallel_workers;	/* wanted number of parallel workers */
+	/* Not null attrs, start from -FirstLowInvalidHeapAttributeNumber */
+	Bitmapset		*notnullattrs;
 
 	/* Information about foreign tables and foreign joins */
 	Oid			serverid;		/* identifies server for the table or join */
-- 
2.27.0

v9-0002-Introduce-UniqueKey-attributes-on-RelOptInfo-stru.patchapplication/octet-stream; name=v9-0002-Introduce-UniqueKey-attributes-on-RelOptInfo-stru.patchDownload

From 73649873fb5187fc0c291fb2cf34b4a12c95fa8a Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Mon, 11 May 2020 15:50:52 +0800
Subject: [PATCH 2/5] Introduce UniqueKey attributes on RelOptInfo struct.

UniqueKey is a set of exprs on RelOptInfo which represents the exprs
will be unique on the given RelOptInfo. You can see README.uniquekey
for more information.
---
 src/backend/nodes/copyfuncs.c               |   13 +
 src/backend/nodes/list.c                    |   31 +
 src/backend/nodes/makefuncs.c               |   13 +
 src/backend/nodes/outfuncs.c                |   11 +
 src/backend/nodes/readfuncs.c               |   10 +
 src/backend/optimizer/path/Makefile         |    3 +-
 src/backend/optimizer/path/README.uniquekey |  131 +++
 src/backend/optimizer/path/allpaths.c       |   10 +
 src/backend/optimizer/path/joinpath.c       |    9 +-
 src/backend/optimizer/path/joinrels.c       |    2 +
 src/backend/optimizer/path/pathkeys.c       |    3 +-
 src/backend/optimizer/path/uniquekeys.c     | 1131 +++++++++++++++++++
 src/backend/optimizer/plan/planner.c        |   13 +-
 src/backend/optimizer/prep/prepunion.c      |    2 +
 src/backend/optimizer/util/appendinfo.c     |   44 +
 src/backend/optimizer/util/inherit.c        |   16 +-
 src/include/nodes/makefuncs.h               |    3 +
 src/include/nodes/nodes.h                   |    1 +
 src/include/nodes/pathnodes.h               |   29 +-
 src/include/nodes/pg_list.h                 |    2 +
 src/include/optimizer/appendinfo.h          |    3 +
 src/include/optimizer/optimizer.h           |    2 +
 src/include/optimizer/paths.h               |   43 +
 23 files changed, 1502 insertions(+), 23 deletions(-)
 create mode 100644 src/backend/optimizer/path/README.uniquekey
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 89c409de66..1f50400fd2 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2273,6 +2273,16 @@ _copyPathKey(const PathKey *from)
 	return newnode;
 }
 
+static UniqueKey *
+_copyUniqueKey(const UniqueKey *from)
+{
+	UniqueKey	*newnode = makeNode(UniqueKey);
+
+	COPY_NODE_FIELD(exprs);
+	COPY_SCALAR_FIELD(multi_nullvals);
+
+	return newnode;
+}
 /*
  * _copyRestrictInfo
  */
@@ -5152,6 +5162,9 @@ copyObjectImpl(const void *from)
 		case T_PathKey:
 			retval = _copyPathKey(from);
 			break;
+		case T_UniqueKey:
+			retval = _copyUniqueKey(from);
+			break;
 		case T_RestrictInfo:
 			retval = _copyRestrictInfo(from);
 			break;
diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index 80fa8c84e4..a7a99b70f2 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -687,6 +687,37 @@ list_member_oid(const List *list, Oid datum)
 	return false;
 }
 
+/*
+ * return true iff every entry in "members" list is also present
+ * in the "target" list.
+ */
+bool
+list_is_subset(const List *members, const List *target)
+{
+	const ListCell	*lc1, *lc2;
+
+	Assert(IsPointerList(members));
+	Assert(IsPointerList(target));
+	check_list_invariants(members);
+	check_list_invariants(target);
+
+	foreach(lc1, members)
+	{
+		bool found = false;
+		foreach(lc2, target)
+		{
+			if (equal(lfirst(lc1), lfirst(lc2)))
+			{
+				found = true;
+				break;
+			}
+		}
+		if (!found)
+			return false;
+	}
+	return true;
+}
+
 /*
  * Delete the n'th cell (counting from 0) in list.
  *
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 49de285f01..646cf7c9a1 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -814,3 +814,16 @@ makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols)
 	v->va_cols = va_cols;
 	return v;
 }
+
+
+/*
+ * makeUniqueKey
+ */
+UniqueKey*
+makeUniqueKey(List *exprs, bool multi_nullvals)
+{
+	UniqueKey * ukey = makeNode(UniqueKey);
+	ukey->exprs = exprs;
+	ukey->multi_nullvals = multi_nullvals;
+	return ukey;
+}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e2f177515d..c3a9632992 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2428,6 +2428,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+	WRITE_NODE_FIELD(exprs);
+	WRITE_BOOL_FIELD(multi_nullvals);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4127,6 +4135,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 				_outPathKey(str, obj);
 				break;
+			case T_UniqueKey:
+				_outUniqueKey(str, obj);
+				break;
 			case T_PathTarget:
 				_outPathTarget(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 42050ab719..3a18571d0c 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -452,6 +452,14 @@ _readSetOperationStmt(void)
 	READ_DONE();
 }
 
+static UniqueKey *
+_readUniqueKey(void)
+{
+	READ_LOCALS(UniqueKey);
+	READ_NODE_FIELD(exprs);
+	READ_BOOL_FIELD(multi_nullvals);
+	READ_DONE();
+}
 
 /*
  *	Stuff from primnodes.h.
@@ -2656,6 +2664,8 @@ parseNodeString(void)
 		return_value = _readCommonTableExpr();
 	else if (MATCH("SETOPERATIONSTMT", 16))
 		return_value = _readSetOperationStmt();
+	else if (MATCH("UNIQUEKEY", 9))
+		return_value = _readUniqueKey();
 	else if (MATCH("ALIAS", 5))
 		return_value = _readAlias();
 	else if (MATCH("RANGEVAR", 8))
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..7b9820c25f 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekeys.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/README.uniquekey b/src/backend/optimizer/path/README.uniquekey
new file mode 100644
index 0000000000..5eac761995
--- /dev/null
+++ b/src/backend/optimizer/path/README.uniquekey
@@ -0,0 +1,131 @@
+1. What is UniqueKey?
+We can think UniqueKey is a set of exprs for a RelOptInfo, which we are insure
+that doesn't yields same result among all the rows. The simplest UniqueKey
+format is primary key.
+
+However we define the UnqiueKey as below.
+
+typedef struct UniqueKey
+{
+        NodeTag	type;
+        List	*exprs;
+        bool	multi_nullvals;
+} UniqueKey;
+
+exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+is a special case of UniqueKey, which means there is only one row in that
+relation.it has a stronger semantic than others. like SELECT uk FROM t; uk is
+normal unique key and may have different values. SELECT colx FROM t WHERE uk =
+const.  colx is unique AND we have only 1 value. This field can used for
+innerrel_is_unique. this logic is handled specially in add_uniquekey_for_onerow
+function.
+
+multi_nullvals: true means multi null values may exist in these exprs, so the
+uniqueness is not guaranteed in this case. This field is necessary for
+remove_useless_join & reduce_unique_semijoins where we don't mind these
+duplicated NULL values. It is set to true for 2 cases. One is a unique key
+from a unique index but the related column is nullable. The other one is for
+outer join. see populate_joinrel_uniquekeys for detail.
+
+
+The UniqueKey can be used at the following cases at least:
+1. remove_useless_joins.
+2. reduce_semianti_joins
+3. remove distinct node if distinct clause is unique.
+4. remove aggnode if group by clause is unique.
+5. Index Skip Scan (WIP)
+6. Aggregation Push Down without 2 phase aggregation if the join can't
+   duplicated the aggregated rows. (work in progress feature)
+
+2. How is it maintained?
+
+We have a set of populate_xxx_unqiuekeys functions to maintain the uniquekey on
+various cases. xxx includes baserel, joinrel, partitionedrel, distinctrel,
+groupedrel, unionrel. and we also need to convert the uniquekey from subquery
+to outer relation, which is what convert_subquery_uniquekeys does.
+
+1. The first part is about baserel. We handled 3 cases. suppose we have Unique
+Index on (a, b).
+
+1. SELECT a, b FROM t.  UniqueKey (a, b)
+2. SELECT a FROM t WHERE b = 1;  UniqueKey (a)
+3. SELECT .. FROM t WHERE a = 1 AND b = 1;  UniqueKey (NIL).  onerow case, every
+   column is Unique.
+
+2. The next part is joinrel, this part is most error-prone, we simplified the rules
+like below:
+1. If the relation's UniqueKey can't be duplicated after join,  then is will be
+   still valid for the join rel. The function we used here is
+   innerrel_keeps_unique. The basic idea is innerrel.any_col = outer.uk.
+
+2. If the UnqiueKey can't keep valid via the rule 1, the combination of the
+   UniqueKey from both sides are valid for sure.  We can prove this as: if the
+   unique exprs from rel1 is duplicated by rel2, the duplicated rows must
+   contains different unique exprs from rel2.
+
+More considerations about onerow:
+1. If relation with one row and it can't be duplicated, it is still possible
+   contains mulit_nullvas after outer join.
+2. If the either UniqueKey can be duplicated after join, the can get one row
+   only when both side is one row AND there is no outer join.
+3. Whenever the onerow UniqueKey is not a valid any more, we need to convert one
+   row UniqueKey to normal unique key since we don't store exprs for one-row
+   relation. get_exprs_from_uniquekeys will be used here.
+
+
+More considerations about multi_nullvals after join:
+1. If the original UnqiueKey has multi_nullvals, the final UniqueKey will have
+   mulit_nullvals in any case.
+2. If a unique key doesn't allow mulit_nullvals, after some outer join, it
+   allows some outer join.
+
+
+3. When we comes to subquery, we need to convert_subquery_unqiuekeys just like
+convert_subquery_pathkeys.  Only the UniqueKey insides subquery is referenced as
+a Var in outer relation will be reused. The relationship between the outerrel.Var
+and subquery.exprs is built with outerel->subroot->processed_tlist.
+
+
+4. As for the SRF functions, it will break the uniqueness of uniquekey, However it
+is handled in adjust_paths_for_srfs, which happens after the query_planner. so
+we will maintain the UniqueKey until there and reset it to NIL at that
+places. This can't help on distinct/group by elimination cases but probably help
+in some other cases, like reduce_unqiue_semijoins/remove_useless_joins and it is
+semantic correctly.
+
+
+5. As for inherit table, we first main the UnqiueKey on childrel as well. But for
+partitioned table we need to maintain 2 different kinds of
+UnqiueKey. 1). UniqueKey on the parent relation 2). UniqueKey on child
+relation for partition wise query.
+
+Example:
+CREATE TABLE p (a int not null, b int not null) partition by list (a);
+CREATE TABLE p0 partition of p for values in (1);
+CREATE TABLE p1 partition of p for values in (2);
+
+create unique index p0_b on p0(b);
+create unique index p1_b on p1(b);
+
+Now b is only unique on partition level, so the distinct can't be removed on
+the following cases. SELECT DISTINCT b FROM p;
+
+Another example is SELECT DISTINCT a, b FROM p WHERE a = 1; Since only one
+partition is chosen, the UniqueKey on child relation is same as the UniqueKey on
+parent relation.
+
+Another usage of UniqueKey on partition level is it be helpful for
+partition-wise join.
+
+As for the UniqueKey on parent table level, it comes with 2 different ways,
+1). the UniqueKey is also derived in UniqueKey index, but the index must be same
+in all the related children relations and the unique index must contains
+Partition Key in it. Example:
+
+CREATE UNIQUE INDEX p_ab ON p(a, b);  -- where a is the partition key.
+
+-- Query
+SELECT a, b FROM p; the (a, b) is a UniqueKey of p.
+
+2). If the parent relation has only one childrel, the UniqueKey on childrel is
+ the UniqueKey on parent as well.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 484dab0a1a..2ad9d06d7a 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -579,6 +579,12 @@ set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	 */
 	check_index_predicates(root, rel);
 
+	/*
+	 * Now that we've marked which partial indexes are suitable, we can now
+	 * build the relation's unique keys.
+	 */
+	populate_baserel_uniquekeys(root, rel, rel->indexlist);
+
 	/* Mark rel with estimated output rows, width, etc */
 	set_baserel_size_estimates(root, rel);
 }
@@ -1310,6 +1316,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 
 	/* Add paths to the append relation. */
 	add_paths_to_append_rel(root, rel, live_childrels);
+	if (IS_PARTITIONED_REL(rel))
+		populate_partitionedrel_uniquekeys(root, rel, live_childrels);
 }
 
 
@@ -2383,6 +2391,8 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 										  pathkeys, required_outer));
 	}
 
+	convert_subquery_uniquekeys(root, rel, sub_final_rel);
+
 	/* If outer rel allows parallelism, do same for partial paths. */
 	if (rel->consider_parallel && bms_is_empty(required_outer))
 	{
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index db54a6ba2e..ef0fd2fb0b 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -71,13 +71,6 @@ static void consider_parallel_mergejoin(PlannerInfo *root,
 static void hash_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 								 RelOptInfo *outerrel, RelOptInfo *innerrel,
 								 JoinType jointype, JoinPathExtraData *extra);
-static List *select_mergejoin_clauses(PlannerInfo *root,
-									  RelOptInfo *joinrel,
-									  RelOptInfo *outerrel,
-									  RelOptInfo *innerrel,
-									  List *restrictlist,
-									  JoinType jointype,
-									  bool *mergejoin_allowed);
 static void generate_mergejoin_paths(PlannerInfo *root,
 									 RelOptInfo *joinrel,
 									 RelOptInfo *innerrel,
@@ -1927,7 +1920,7 @@ hash_inner_and_outer(PlannerInfo *root,
  * if it is mergejoinable and involves vars from the two sub-relations
  * currently of interest.
  */
-static List *
+List *
 select_mergejoin_clauses(PlannerInfo *root,
 						 RelOptInfo *joinrel,
 						 RelOptInfo *outerrel,
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 2d343cd293..b9163ee8ff 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -924,6 +924,8 @@ populate_joinrel_with_paths(PlannerInfo *root, RelOptInfo *rel1,
 
 	/* Apply partitionwise join technique, if possible. */
 	try_partitionwise_join(root, rel1, rel2, joinrel, sjinfo, restrictlist);
+
+	populate_joinrel_uniquekeys(root, joinrel, rel1, rel2, restrictlist, sjinfo->jointype);
 }
 
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ce9bf87e9b..7e596d4194 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -33,7 +33,6 @@ static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
 											 int partkeycol);
-static Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 static bool right_merge_direction(PlannerInfo *root, PathKey *pathkey);
 
 
@@ -1035,7 +1034,7 @@ convert_subquery_pathkeys(PlannerInfo *root, RelOptInfo *rel,
  * We need this to ensure that we don't return pathkeys describing values
  * that are unavailable above the level of the subquery scan.
  */
-static Var *
+Var *
 find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle)
 {
 	ListCell   *lc;
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
new file mode 100644
index 0000000000..b33bcd2f32
--- /dev/null
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -0,0 +1,1131 @@
+/*-------------------------------------------------------------------------
+ *
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/uniquekeys.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/appendinfo.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/tlist.h"
+#include "rewrite/rewriteManip.h"
+
+
+/*
+ * This struct is used to help populate_joinrel_uniquekeys.
+ *
+ * added_to_joinrel is true if a uniquekey (from outerrel or innerrel)
+ * has been added to joinrel.
+ * useful is true if the exprs of the uniquekey still appears in joinrel.
+ */
+typedef struct UniqueKeyContextData
+{
+	UniqueKey	*uniquekey;
+	bool	added_to_joinrel;
+	bool	useful;
+} *UniqueKeyContext;
+
+static List *initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel);
+static bool innerrel_keeps_unique(PlannerInfo *root,
+								  RelOptInfo *outerrel,
+								  RelOptInfo *innerrel,
+								  List *restrictlist,
+								  bool reverse);
+
+static List *get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+										List *const_exprs,
+										List *const_expr_opfamilies,
+										Bitmapset *used_varattrs,
+										bool *useful,
+										bool *multi_nullvals);
+static List *get_exprs_from_uniquekey(RelOptInfo *joinrel,
+									  RelOptInfo *rel1,
+									  UniqueKey *ukey);
+static void add_uniquekey_for_onerow(RelOptInfo *rel);
+static bool add_combined_uniquekey(RelOptInfo *joinrel,
+								   RelOptInfo *outer_rel,
+								   RelOptInfo *inner_rel,
+								   UniqueKey *outer_ukey,
+								   UniqueKey *inner_ukey,
+								   JoinType jointype);
+
+/* Used for unique indexes checking for partitioned table */
+static bool index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind);
+static IndexOptInfo *simple_copy_indexinfo_to_parent(PlannerInfo *root,
+													 RelOptInfo *parentrel,
+													 IndexOptInfo *from);
+static bool simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2);
+static void adjust_partition_unique_indexlist(PlannerInfo *root,
+											  RelOptInfo *parentrel,
+											  RelOptInfo *childrel,
+											  List **global_unique_index);
+
+/* Helper function for grouped relation and distinct relation. */
+static void add_uniquekey_from_sortgroups(PlannerInfo *root,
+										  RelOptInfo *rel,
+										  List *sortgroups);
+
+/*
+ * populate_baserel_uniquekeys
+ *		Populate 'baserel' uniquekeys list by looking at the rel's unique index
+ * and baserestrictinfo
+ */
+void
+populate_baserel_uniquekeys(PlannerInfo *root,
+							RelOptInfo *baserel,
+							List *indexlist)
+{
+	ListCell *lc;
+	List	*matched_uniq_indexes = NIL;
+
+	/* Attrs appears in rel->reltarget->exprs. */
+	Bitmapset *used_attrs = NULL;
+
+	List	*const_exprs = NIL;
+	List	*expr_opfamilies = NIL;
+
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	foreach(lc, indexlist)
+	{
+		IndexOptInfo *ind = (IndexOptInfo *) lfirst(lc);
+		if (!ind->unique || !ind->immediate ||
+			(ind->indpred != NIL && !ind->predOK))
+			continue;
+		matched_uniq_indexes = lappend(matched_uniq_indexes, ind);
+	}
+
+	if (matched_uniq_indexes  == NIL)
+		return;
+
+	/* Check which attrs is used in baserel->reltarget */
+	pull_varattnos((Node *)baserel->reltarget->exprs, baserel->relid, &used_attrs);
+
+	/* Check which attrno is used at a mergeable const filter */
+	foreach(lc, baserel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+		if (rinfo->mergeopfamilies == NIL)
+			continue;
+
+		if (bms_is_empty(rinfo->left_relids))
+		{
+			const_exprs = lappend(const_exprs, get_rightop(rinfo->clause));
+		}
+		else if (bms_is_empty(rinfo->right_relids))
+		{
+			const_exprs = lappend(const_exprs, get_leftop(rinfo->clause));
+		}
+		else
+			continue;
+
+		expr_opfamilies = lappend(expr_opfamilies, rinfo->mergeopfamilies);
+	}
+
+	foreach(lc, matched_uniq_indexes)
+	{
+		bool	multi_nullvals, useful;
+		List	*exprs = get_exprs_from_uniqueindex(lfirst_node(IndexOptInfo, lc),
+													const_exprs,
+													expr_opfamilies,
+													used_attrs,
+													&useful,
+													&multi_nullvals);
+		if (useful)
+		{
+			if (exprs == NIL)
+			{
+				/* All the columns in Unique Index matched with a restrictinfo */
+				add_uniquekey_for_onerow(baserel);
+				return;
+			}
+			baserel->uniquekeys = lappend(baserel->uniquekeys,
+										  makeUniqueKey(exprs, multi_nullvals));
+		}
+	}
+}
+
+
+/*
+ * populate_partitionedrel_uniquekeys
+ * The UniqueKey on partitionrel comes from 2 cases:
+ * 1). Only one partition is involved in this query, the unique key can be
+ * copied to parent rel from childrel.
+ * 2). There are some unique index which includes partition key and exists
+ * in all the related partitions.
+ * We never mind rule 2 if we hit rule 1.
+ */
+
+void
+populate_partitionedrel_uniquekeys(PlannerInfo *root,
+								   RelOptInfo *rel,
+								   List *childrels)
+{
+	ListCell	*lc;
+	List	*global_uniq_indexlist = NIL;
+	RelOptInfo *childrel;
+	bool is_first = true;
+
+	Assert(IS_PARTITIONED_REL(rel));
+
+	if (childrels == NIL)
+		return;
+
+	/*
+	 * If there is only one partition used in this query, the UniqueKey in childrel is
+	 * still valid in parent level, but we need convert the format from child expr to
+	 * parent expr.
+	 */
+	if (list_length(childrels) == 1)
+	{
+		/* Check for Rule 1 */
+		RelOptInfo *childrel = linitial_node(RelOptInfo, childrels);
+		ListCell	*lc;
+		Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL);
+		if (relation_is_onerow(childrel))
+		{
+			add_uniquekey_for_onerow(rel);
+			return;
+		}
+
+		foreach(lc, childrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			AppendRelInfo *appinfo = find_appinfo_by_child(root, childrel->relid);
+			List *parent_exprs = NIL;
+			bool can_reuse = true;
+			ListCell	*lc2;
+			foreach(lc2, ukey->exprs)
+			{
+				Var *var = (Var *)lfirst(lc2);
+				/*
+				 * If the expr comes from a expression, it is hard to build the expression
+				 * in parent so ignore that case for now.
+				 */
+				if(!IsA(var, Var))
+				{
+					can_reuse = false;
+					break;
+				}
+				/* Convert it to parent var */
+				parent_exprs = lappend(parent_exprs, find_parent_var(appinfo, var));
+			}
+			if (can_reuse)
+				rel->uniquekeys = lappend(rel->uniquekeys,
+										  makeUniqueKey(parent_exprs,
+														ukey->multi_nullvals));
+		}
+	}
+	else
+	{
+		/* Check for rule 2 */
+		childrel = linitial_node(RelOptInfo, childrels);
+		foreach(lc, childrel->indexlist)
+		{
+			IndexOptInfo *ind = lfirst(lc);
+			IndexOptInfo *modified_index;
+			if (!ind->unique || !ind->immediate ||
+				(ind->indpred != NIL && !ind->predOK))
+				continue;
+
+			/*
+			 * During simple_copy_indexinfo_to_parent, we need to convert var from
+			 * child var to parent var, index on expression is too complex to handle.
+			 * so ignore it for now.
+			 */
+			if (ind->indexprs != NIL)
+				continue;
+
+			modified_index = simple_copy_indexinfo_to_parent(root, rel, ind);
+			/*
+			 * If the unique index doesn't contain partkey, then it is unique
+			 * on this partition only, so it is useless for us.
+			 */
+			if (!index_constains_partkey(rel, modified_index))
+				continue;
+
+			global_uniq_indexlist = lappend(global_uniq_indexlist,  modified_index);
+		}
+
+		if (global_uniq_indexlist != NIL)
+		{
+			foreach(lc, childrels)
+			{
+				RelOptInfo *child = lfirst(lc);
+				if (is_first)
+				{
+					is_first = false;
+					continue;
+				}
+				adjust_partition_unique_indexlist(root, rel, child, &global_uniq_indexlist);
+			}
+			/* Now we have a list of unique index which are exactly same on all childrels,
+			 * Set the UniqueKey just like it is non-partition table
+			 */
+			populate_baserel_uniquekeys(root, rel, global_uniq_indexlist);
+		}
+	}
+}
+
+
+/*
+ * populate_distinctrel_uniquekeys
+ */
+void
+populate_distinctrel_uniquekeys(PlannerInfo *root,
+								RelOptInfo *inputrel,
+								RelOptInfo *distinctrel)
+{
+	/* The unique key before the distinct is still valid. */
+	distinctrel->uniquekeys = list_copy(inputrel->uniquekeys);
+	add_uniquekey_from_sortgroups(root, distinctrel, root->parse->distinctClause);
+}
+
+/*
+ * populate_grouprel_uniquekeys
+ */
+void
+populate_grouprel_uniquekeys(PlannerInfo *root,
+							 RelOptInfo *grouprel,
+							 RelOptInfo *inputrel)
+
+{
+	Query *parse = root->parse;
+	bool input_ukey_added = false;
+	ListCell *lc;
+
+	if (relation_is_onerow(inputrel))
+	{
+		add_uniquekey_for_onerow(grouprel);
+		return;
+	}
+	if (parse->groupingSets)
+		return;
+
+	/* A Normal group by without grouping set. */
+	if (parse->groupClause)
+	{
+		/*
+		 * Current even the groupby clause is Unique already, but if query has aggref
+		 * We have to create grouprel still. To keep the UnqiueKey short, we will check
+		 * the UniqueKey of input_rel still valid, if so we reuse it.
+		 */
+		foreach(lc, inputrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(ukey->exprs, grouprel->reltarget->exprs))
+			{
+				grouprel->uniquekeys = lappend(grouprel->uniquekeys,
+											   ukey);
+				input_ukey_added = true;
+			}
+		}
+		if (!input_ukey_added)
+			/*
+			 * group by clause must be a super-set of grouprel->reltarget->exprs except the
+			 * aggregation expr, so if such exprs is unique already, no bother to generate
+			 * new uniquekey for group by exprs.
+			 */
+			add_uniquekey_from_sortgroups(root,
+										  grouprel,
+										  root->parse->groupClause);
+	}
+	else
+		/* It has aggregation but without a group by, so only one row returned */
+		add_uniquekey_for_onerow(grouprel);
+}
+
+/*
+ * simple_copy_uniquekeys
+ * Using a function for the one-line code makes us easy to check where we simply
+ * copied the uniquekey.
+ */
+void
+simple_copy_uniquekeys(RelOptInfo *oldrel,
+					   RelOptInfo *newrel)
+{
+	newrel->uniquekeys = oldrel->uniquekeys;
+}
+
+/*
+ *  populate_unionrel_uniquekeys
+ */
+void
+populate_unionrel_uniquekeys(PlannerInfo *root,
+							  RelOptInfo *unionrel)
+{
+	ListCell	*lc;
+	List	*exprs = NIL;
+
+	Assert(unionrel->uniquekeys == NIL);
+
+	foreach(lc, unionrel->reltarget->exprs)
+	{
+		exprs = lappend(exprs, lfirst(lc));
+	}
+
+	if (exprs == NIL)
+		/* SQL: select union select; is valid, we need to handle it here. */
+		add_uniquekey_for_onerow(unionrel);
+	else
+		unionrel->uniquekeys = lappend(unionrel->uniquekeys,
+									   makeUniqueKey(exprs,false));
+
+}
+
+/*
+ * populate_joinrel_uniquekeys
+ *
+ * populate uniquekeys for joinrel. We will check each relation to see if its
+ * UniqueKey is still valid via innerrel_keeps_unique, if so, we add it to
+ * joinrel.  The multi_nullvals field will be changed to true for some outer
+ * join cases and one-row UniqueKey needs to be converted to normal UniqueKey
+ * for the same case as well.
+ * For the uniquekey in either baserel which can't be unique after join, we still
+ * check to see if combination of UniqueKeys from both side is still useful for us.
+ * if yes, we add it to joinrel as well.
+ */
+void
+populate_joinrel_uniquekeys(PlannerInfo *root, RelOptInfo *joinrel,
+							RelOptInfo *outerrel, RelOptInfo *innerrel,
+							List *restrictlist, JoinType jointype)
+{
+	ListCell *lc, *lc2;
+	List	*clause_list = NIL;
+	List	*outerrel_ukey_ctx;
+	List	*innerrel_ukey_ctx;
+	bool	inner_onerow, outer_onerow;
+	bool	mergejoin_allowed;
+
+	/* Care about the outerrel relation only for SEMI/ANTI join */
+	if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
+	{
+		foreach(lc, outerrel->uniquekeys)
+		{
+			UniqueKey	*uniquekey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(uniquekey->exprs, joinrel->reltarget->exprs))
+				joinrel->uniquekeys = lappend(joinrel->uniquekeys, uniquekey);
+		}
+		return;
+	}
+
+	Assert(jointype == JOIN_LEFT || jointype == JOIN_FULL || jointype == JOIN_INNER);
+
+	/* Fast path */
+	if (innerrel->uniquekeys == NIL || outerrel->uniquekeys == NIL)
+		return;
+
+	inner_onerow = relation_is_onerow(innerrel);
+	outer_onerow = relation_is_onerow(outerrel);
+
+	outerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(outerrel);
+	innerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(innerrel);
+
+	clause_list = select_mergejoin_clauses(root, joinrel, outerrel, innerrel,
+										   restrictlist, jointype,
+										   &mergejoin_allowed);
+
+	if (innerrel_keeps_unique(root, innerrel, outerrel, clause_list, true /* reverse */))
+	{
+		bool outer_impact = jointype == JOIN_FULL;
+		foreach(lc, outerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			/* Outer relation has one row, and the unique key is not duplicated after join,
+			 * the joinrel will still has one row unless the jointype == JOIN_FULL.
+			 */
+			if (outer_onerow && !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (outer_onerow)
+			{
+				/*
+				 * The onerow outerrel becomes multi rows and multi_nullvals
+				 * will be changed to true. We also need to set the exprs correctly since it
+				 * can't be NIL any more.
+				 */
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(joinrel, outerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Change multi_nullvals to true due to the full join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs, true));
+				else
+					/* Just reuse it */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	if (innerrel_keeps_unique(root, outerrel, innerrel, clause_list, false))
+	{
+		bool outer_impact = jointype == JOIN_FULL || jointype == JOIN_LEFT;;
+
+		foreach(lc, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			if (inner_onerow &&  !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (inner_onerow)
+			{
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(joinrel, innerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Need to change multi_nullvals to true due to the outer join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs,
+																true));
+				else
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	/*
+	 * The combination of the UniqueKey from both sides is unique as well regardless
+	 * of join type, but no bother to add it if its subset has been added to joinrel
+	 * already or it is not useful for the joinrel.
+	 */
+	foreach(lc, outerrel_ukey_ctx)
+	{
+		UniqueKeyContext ctx1 = (UniqueKeyContext) lfirst(lc);
+		if (ctx1->added_to_joinrel || !ctx1->useful)
+			continue;
+		foreach(lc2, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx2 = (UniqueKeyContext) lfirst(lc2);
+			if (ctx2->added_to_joinrel || !ctx2->useful)
+				continue;
+			if (add_combined_uniquekey(joinrel, outerrel, innerrel,
+									   ctx1->uniquekey, ctx2->uniquekey,
+									   jointype))
+				/* If we set a onerow UniqueKey to joinrel, we don't need other. */
+				return;
+		}
+	}
+}
+
+
+/*
+ * convert_subquery_uniquekeys
+ *
+ * Covert the UniqueKey in subquery to outer relation.
+ */
+void convert_subquery_uniquekeys(PlannerInfo *root,
+								 RelOptInfo *currel,
+								 RelOptInfo *sub_final_rel)
+{
+	ListCell	*lc;
+
+	if (sub_final_rel->uniquekeys == NIL)
+		return;
+
+	if (relation_is_onerow(sub_final_rel))
+	{
+		add_uniquekey_for_onerow(currel);
+		return;
+	}
+
+	Assert(currel->subroot != NULL);
+
+	foreach(lc, sub_final_rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		ListCell	*lc;
+		List	*exprs = NIL;
+		bool	ukey_useful = true;
+
+		/* One row case is handled above */
+		Assert(ukey->exprs != NIL);
+		foreach(lc, ukey->exprs)
+		{
+			Var *var;
+			TargetEntry *tle = tlist_member(lfirst(lc),
+											currel->subroot->processed_tlist);
+			if (tle == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			var = find_var_for_subquery_tle(currel, tle);
+			if (var == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			exprs = lappend(exprs, var);
+		}
+
+		if (ukey_useful)
+			currel->uniquekeys = lappend(currel->uniquekeys,
+										 makeUniqueKey(exprs,
+													   ukey->multi_nullvals));
+
+	}
+}
+
+/*
+ * innerrel_keeps_unique
+ *
+ * Check if Unique key of the innerrel is valid after join. innerrel's UniqueKey
+ * will be still valid if innerrel's any-column mergeop outrerel's uniquekey
+ * exists in clause_list.
+ *
+ * Note: the clause_list must be a list of mergeable restrictinfo already.
+ */
+static bool
+innerrel_keeps_unique(PlannerInfo *root,
+					  RelOptInfo *outerrel,
+					  RelOptInfo *innerrel,
+					  List *clause_list,
+					  bool reverse)
+{
+	ListCell	*lc, *lc2, *lc3;
+
+	if (outerrel->uniquekeys == NIL || innerrel->uniquekeys == NIL)
+		return false;
+
+	/* Check if there is outerrel's uniquekey in mergeable clause. */
+	foreach(lc, outerrel->uniquekeys)
+	{
+		List	*outer_uq_exprs = lfirst_node(UniqueKey, lc)->exprs;
+		bool clauselist_matchs_all_exprs = true;
+		foreach(lc2, outer_uq_exprs)
+		{
+			Node *outer_uq_expr = lfirst(lc2);
+			bool find_uq_expr_in_clauselist = false;
+			foreach(lc3, clause_list)
+			{
+				RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc3);
+				Node *outer_expr;
+				if (reverse)
+					outer_expr = rinfo->outer_is_left ? get_rightop(rinfo->clause) : get_leftop(rinfo->clause);
+				else
+					outer_expr = rinfo->outer_is_left ? get_leftop(rinfo->clause) : get_rightop(rinfo->clause);
+				if (equal(outer_expr, outer_uq_expr))
+				{
+					find_uq_expr_in_clauselist = true;
+					break;
+				}
+			}
+			if (!find_uq_expr_in_clauselist)
+			{
+				/* No need to check the next exprs in the current uniquekey */
+				clauselist_matchs_all_exprs = false;
+				break;
+			}
+		}
+
+		if (clauselist_matchs_all_exprs)
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * relation_is_onerow
+ * Check if it is a one-row relation by checking UniqueKey.
+ */
+bool
+relation_is_onerow(RelOptInfo *rel)
+{
+	UniqueKey *ukey;
+	if (rel->uniquekeys == NIL)
+		return false;
+	ukey = linitial_node(UniqueKey, rel->uniquekeys);
+	return ukey->exprs == NIL && list_length(rel->uniquekeys) == 1;
+}
+
+/*
+ * relation_has_uniquekeys_for
+ *		Returns true if we have proofs that 'rel' cannot return multiple rows with
+ *		the same values in each of 'exprs'.  Otherwise returns false.
+ */
+bool
+relation_has_uniquekeys_for(PlannerInfo *root, RelOptInfo *rel,
+							List *exprs, bool allow_multinulls)
+{
+	ListCell *lc;
+
+	/*
+	 * For UniqueKey->onerow case, the uniquekey->exprs is empty as well
+	 * so we can't rely on list_is_subset to handle this special cases
+	 */
+	if (exprs == NIL)
+		return false;
+
+	foreach(lc, rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		if (ukey->multi_nullvals && !allow_multinulls)
+			continue;
+		if (list_is_subset(ukey->exprs, exprs))
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * get_exprs_from_uniqueindex
+ *
+ * Return a list of exprs which is unique. set useful to false if this
+ * unique index is not useful for us.
+ */
+static List *
+get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+						   List *const_exprs,
+						   List *const_expr_opfamilies,
+						   Bitmapset *used_varattrs,
+						   bool *useful,
+						   bool *multi_nullvals)
+{
+	List	*exprs = NIL;
+	ListCell	*indexpr_item;
+	int	c = 0;
+
+	*useful = true;
+	*multi_nullvals = false;
+
+	indexpr_item = list_head(unique_index->indexprs);
+	for(c = 0; c < unique_index->ncolumns; c++)
+	{
+		int attr = unique_index->indexkeys[c];
+		Expr *expr;
+		bool	matched_const = false;
+		ListCell	*lc1, *lc2;
+
+		if(attr > 0)
+		{
+			expr = list_nth_node(TargetEntry, unique_index->indextlist, c)->expr;
+		}
+		else if (attr == 0)
+		{
+			/* Expression index */
+			expr = lfirst(indexpr_item);
+			indexpr_item = lnext(unique_index->indexprs, indexpr_item);
+		}
+		else /* attr < 0 */
+		{
+			/* Index on system column is not supported */
+			Assert(false);
+		}
+
+		/*
+		 * Check index_col = Const case with regarding to opfamily checking
+		 * If we can remove the index_col from the final UniqueKey->exprs.
+		 */
+		forboth(lc1, const_exprs, lc2, const_expr_opfamilies)
+		{
+			if (list_member_oid((List *)lfirst(lc2), unique_index->opfamily[c])
+				&& match_index_to_operand((Node *) lfirst(lc1), c, unique_index))
+			{
+				matched_const = true;
+				break;
+			}
+		}
+
+		if (matched_const)
+			continue;
+
+		/* Check if the indexed expr is used in rel */
+		if (attr > 0)
+		{
+			/*
+			 * Normal Indexed column, if the col is not used, then the index is useless
+			 * for uniquekey.
+			 */
+			attr -= FirstLowInvalidHeapAttributeNumber;
+
+			if (!bms_is_member(attr, used_varattrs))
+			{
+				*useful = false;
+				break;
+			}
+		}
+		else if (!list_member(unique_index->rel->reltarget->exprs, expr))
+		{
+			/* Expression index but the expression is not used in rel */
+			*useful = false;
+			break;
+		}
+
+		/* check not null property. */
+		if (attr == 0)
+		{
+			/* We never know if a expression yields null or not */
+			*multi_nullvals = true;
+		}
+		else if (!bms_is_member(attr, unique_index->rel->notnullattrs)
+				 && !bms_is_member(0 - FirstLowInvalidHeapAttributeNumber,
+								   unique_index->rel->notnullattrs))
+		{
+			*multi_nullvals = true;
+		}
+
+		exprs = lappend(exprs, expr);
+	}
+	return exprs;
+}
+
+
+/*
+ * add_uniquekey_for_onerow
+ * If we are sure that the relation only returns one row, then all the columns
+ * are unique. However we don't need to create UniqueKey for every column, we
+ * just set exprs = NIL and overwrites all the other UniqueKey on this RelOptInfo
+ * since this one has strongest semantics.
+ */
+void
+add_uniquekey_for_onerow(RelOptInfo *rel)
+{
+	/*
+	 * We overwrite the previous UniqueKey on purpose since this one has the
+	 * strongest semantic.
+	 */
+	rel->uniquekeys = list_make1(makeUniqueKey(NIL, false));
+}
+
+
+/*
+ * initililze_uniquecontext_for_joinrel
+ * Return a List of UniqueKeyContext for an inputrel
+ */
+static List *
+initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel)
+{
+	List	*res = NIL;
+	ListCell *lc;
+	foreach(lc,  inputrel->uniquekeys)
+	{
+		UniqueKeyContext context;
+		context = palloc(sizeof(struct UniqueKeyContextData));
+		context->uniquekey = lfirst_node(UniqueKey, lc);
+		context->added_to_joinrel = false;
+		context->useful = true;
+		res = lappend(res, context);
+	}
+	return res;
+}
+
+
+/*
+ * get_exprs_from_uniquekey
+ *	Unify the way of get List of exprs from a one-row UniqueKey or
+ * normal UniqueKey. for the onerow case, every expr in rel1 is a valid
+ * UniqueKey. Return a List of exprs.
+ *
+ * rel1: The relation which you want to get the exprs.
+ * ukey: The UniqueKey you want to get the exprs.
+ */
+static List *
+get_exprs_from_uniquekey(RelOptInfo *joinrel, RelOptInfo *rel1, UniqueKey *ukey)
+{
+	ListCell *lc;
+	bool onerow = rel1 != NULL && relation_is_onerow(rel1);
+
+	List	*res = NIL;
+	Assert(onerow || ukey);
+	if (onerow)
+	{
+		/* Only cares about the exprs still exist in joinrel */
+		foreach(lc, joinrel->reltarget->exprs)
+		{
+			Bitmapset *relids = pull_varnos(lfirst(lc));
+			if (bms_is_subset(relids, rel1->relids))
+			{
+				res = lappend(res, list_make1(lfirst(lc)));
+			}
+		}
+	}
+	else
+	{
+		res = list_make1(ukey->exprs);
+	}
+	return res;
+}
+
+/*
+ * Partitioned table Unique Keys.
+ * The partition table unique key is maintained as:
+ * 1. The index must be unique as usual.
+ * 2. The index must contains partition key.
+ * 3. The index must exist on all the child rel. see simple_indexinfo_equal for
+ *    how we compare it.
+ */
+
+/*
+ * index_constains_partkey
+ * return true if the index contains the partiton key.
+ */
+static bool
+index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind)
+{
+	ListCell	*lc;
+	int	i;
+	Assert(IS_PARTITIONED_REL(partrel));
+	Assert(partrel->part_scheme->partnatts > 0);
+
+	for(i = 0; i < partrel->part_scheme->partnatts; i++)
+	{
+		Node *part_expr = linitial(partrel->partexprs[i]);
+		bool found_in_index = false;
+		foreach(lc, ind->indextlist)
+		{
+			Expr *index_expr = lfirst_node(TargetEntry, lc)->expr;
+			if (equal(index_expr, part_expr))
+			{
+				found_in_index = true;
+				break;
+			}
+		}
+		if (!found_in_index)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * simple_indexinfo_equal
+ *
+ * Used to check if the 2 index is same as each other. The index here
+ * is COPIED from childrel and did some tiny changes(see
+ * simple_copy_indexinfo_to_parent)
+ */
+static bool
+simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2)
+{
+	Size oid_cmp_len = sizeof(Oid) * ind1->ncolumns;
+
+	return ind1->ncolumns == ind2->ncolumns &&
+		ind1->unique == ind2->unique &&
+		memcmp(ind1->indexkeys, ind2->indexkeys, sizeof(int) * ind1->ncolumns) == 0 &&
+		memcmp(ind1->opfamily, ind2->opfamily, oid_cmp_len) == 0 &&
+		memcmp(ind1->opcintype, ind2->opcintype, oid_cmp_len) == 0 &&
+		memcmp(ind1->sortopfamily, ind2->sortopfamily, oid_cmp_len) == 0 &&
+		equal(get_tlist_exprs(ind1->indextlist, true),
+			  get_tlist_exprs(ind2->indextlist, true));
+}
+
+
+/*
+ * The below macros are used for simple_copy_indexinfo_to_parent which is so
+ * customized that I don't want to put it to copyfuncs.c. So copy it here.
+ */
+#define COPY_POINTER_FIELD(fldname, sz) \
+	do { \
+		Size	_size = (sz); \
+		newnode->fldname = palloc(_size); \
+		memcpy(newnode->fldname, from->fldname, _size); \
+	} while (0)
+
+#define COPY_NODE_FIELD(fldname) \
+	(newnode->fldname = copyObjectImpl(from->fldname))
+
+#define COPY_SCALAR_FIELD(fldname) \
+	(newnode->fldname = from->fldname)
+
+
+/*
+ * simple_copy_indexinfo_to_parent (from partition)
+ * Copy the IndexInfo from child relation to parent relation with some modification,
+ * which is used to test:
+ * 1. If the same index exists in all the childrels.
+ * 2. If the parentrel->reltarget/basicrestrict info matches this index.
+ */
+static IndexOptInfo *
+simple_copy_indexinfo_to_parent(PlannerInfo *root,
+								RelOptInfo *parentrel,
+								IndexOptInfo *from)
+{
+	IndexOptInfo *newnode = makeNode(IndexOptInfo);
+	AppendRelInfo *appinfo = find_appinfo_by_child(root, from->rel->relid);
+	ListCell	*lc;
+	int	idx = 0;
+
+	COPY_SCALAR_FIELD(ncolumns);
+	COPY_SCALAR_FIELD(nkeycolumns);
+	COPY_SCALAR_FIELD(unique);
+	COPY_SCALAR_FIELD(immediate);
+	/* We just need to know if it is NIL or not */
+	COPY_SCALAR_FIELD(indpred);
+	COPY_SCALAR_FIELD(predOK);
+	COPY_POINTER_FIELD(indexkeys, from->ncolumns * sizeof(int));
+	COPY_POINTER_FIELD(indexcollations, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opfamily, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opcintype, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(sortopfamily, from->ncolumns * sizeof(Oid));
+	COPY_NODE_FIELD(indextlist);
+
+	/* Convert index exprs on child expr to expr on parent */
+	foreach(lc, newnode->indextlist)
+	{
+		TargetEntry *tle = lfirst_node(TargetEntry, lc);
+		/* Index on expression is ignored */
+		Assert(IsA(tle->expr, Var));
+		tle->expr = (Expr *) find_parent_var(appinfo, (Var *) tle->expr);
+		newnode->indexkeys[idx] = castNode(Var, tle->expr)->varattno;
+		idx++;
+	}
+	newnode->rel = parentrel;
+	return newnode;
+}
+
+/*
+ * adjust_partition_unique_indexlist
+ *
+ * global_unique_indexes: At the beginning, it contains the copy & modified
+ * unique index from the first partition. And then check if each index in it still
+ * exists in the following partitions. If no, remove it. at last, it has an
+ * index list which exists in all the partitions.
+ */
+static void
+adjust_partition_unique_indexlist(PlannerInfo *root,
+								  RelOptInfo *parentrel,
+								  RelOptInfo *childrel,
+								  List **global_unique_indexes)
+{
+	ListCell	*lc, *lc2;
+	foreach(lc, *global_unique_indexes)
+	{
+		IndexOptInfo	*g_ind = lfirst_node(IndexOptInfo, lc);
+		bool found_in_child = false;
+
+		foreach(lc2, childrel->indexlist)
+		{
+			IndexOptInfo   *p_ind = lfirst_node(IndexOptInfo, lc2);
+			IndexOptInfo   *p_ind_copy;
+			if (!p_ind->unique || !p_ind->immediate ||
+				(p_ind->indpred != NIL && !p_ind->predOK))
+				continue;
+			p_ind_copy = simple_copy_indexinfo_to_parent(root, parentrel, p_ind);
+			if (simple_indexinfo_equal(p_ind_copy, g_ind))
+			{
+				found_in_child = true;
+				break;
+			}
+		}
+		if (!found_in_child)
+			/* The index doesn't exist in childrel, remove it from global_unique_indexes */
+			*global_unique_indexes = foreach_delete_current(*global_unique_indexes, lc);
+	}
+}
+
+/* Helper function for groupres/distinctrel */
+static void
+add_uniquekey_from_sortgroups(PlannerInfo *root, RelOptInfo *rel, List *sortgroups)
+{
+	Query *parse = root->parse;
+	List	*exprs;
+
+	/*
+	 * XXX: If there are some vars which is not in current levelsup, the semantic is
+	 * imprecise, should we avoid it or not? levelsup = 1 is just a demo, maybe we need to
+	 * check every level other than 0, if so, looks we have to write another
+	 * pull_var_walker.
+	 */
+	List	*upper_vars = pull_vars_of_level((Node*)sortgroups, 1);
+
+	if (upper_vars != NIL)
+		return;
+
+	exprs = get_sortgrouplist_exprs(sortgroups, parse->targetList);
+	rel->uniquekeys = lappend(rel->uniquekeys,
+							  makeUniqueKey(exprs,
+											false /* sortgroupclause can't be multi_nullvals */));
+}
+
+
+/*
+ * add_combined_uniquekey
+ * The combination of both UniqueKeys is a valid UniqueKey for joinrel no matter
+ * the jointype.
+ */
+bool
+add_combined_uniquekey(RelOptInfo *joinrel,
+					   RelOptInfo *outer_rel,
+					   RelOptInfo *inner_rel,
+					   UniqueKey *outer_ukey,
+					   UniqueKey *inner_ukey,
+					   JoinType jointype)
+{
+
+	ListCell	*lc1, *lc2;
+
+	/* Either side has multi_nullvals or we have outer join,
+	 * the combined UniqueKey has multi_nullvals */
+	bool multi_nullvals = outer_ukey->multi_nullvals ||
+		inner_ukey->multi_nullvals || IS_OUTER_JOIN(jointype);
+
+	/* The only case we can get onerow joinrel after join */
+	if  (relation_is_onerow(outer_rel)
+		 && relation_is_onerow(inner_rel)
+		 && jointype == JOIN_INNER)
+	{
+		add_uniquekey_for_onerow(joinrel);
+		return true;
+	}
+
+	foreach(lc1, get_exprs_from_uniquekey(joinrel, outer_rel, outer_ukey))
+	{
+		foreach(lc2, get_exprs_from_uniquekey(joinrel, inner_rel, inner_ukey))
+		{
+			List *exprs = list_concat_copy(lfirst_node(List, lc1), lfirst_node(List, lc2));
+			joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+										  makeUniqueKey(exprs,
+														multi_nullvals));
+		}
+	}
+	return false;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b406d41e91..0551ae0512 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -2389,6 +2389,8 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 		add_path(final_rel, path);
 	}
 
+	simple_copy_uniquekeys(current_rel, final_rel);
+
 	/*
 	 * Generate partial paths for final_rel, too, if outer query levels might
 	 * be able to make use of them.
@@ -3899,6 +3901,8 @@ create_grouping_paths(PlannerInfo *root,
 	}
 
 	set_cheapest(grouped_rel);
+
+	populate_grouprel_uniquekeys(root, grouped_rel, input_rel);
 	return grouped_rel;
 }
 
@@ -4615,7 +4619,7 @@ create_window_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(window_rel);
-
+	simple_copy_uniquekeys(input_rel, window_rel);
 	return window_rel;
 }
 
@@ -4911,7 +4915,7 @@ create_distinct_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(distinct_rel);
-
+	populate_distinctrel_uniquekeys(root, input_rel, distinct_rel);
 	return distinct_rel;
 }
 
@@ -5172,6 +5176,8 @@ create_ordered_paths(PlannerInfo *root,
 	 */
 	Assert(ordered_rel->pathlist != NIL);
 
+	simple_copy_uniquekeys(input_rel, ordered_rel);
+
 	return ordered_rel;
 }
 
@@ -6049,6 +6055,9 @@ adjust_paths_for_srfs(PlannerInfo *root, RelOptInfo *rel,
 	if (list_length(targets) == 1)
 		return;
 
+	/* UniqueKey is not valid after handling the SRF. */
+	rel->uniquekeys = NIL;
+
 	/*
 	 * Stack SRF-evaluation nodes atop each path for the rel.
 	 *
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index 951aed80e7..e94e92937c 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -689,6 +689,8 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/* Undo effects of possibly forcing tuple_fraction to 0 */
 	root->tuple_fraction = save_fraction;
 
+	/* Add the UniqueKeys */
+	populate_unionrel_uniquekeys(root, result_rel);
 	return result_rel;
 }
 
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index d722063cf3..44c37ecffc 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -746,3 +746,47 @@ find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
 	}
 	return appinfos;
 }
+
+/*
+ * find_appinfo_by_child
+ *
+ */
+AppendRelInfo *
+find_appinfo_by_child(PlannerInfo *root, Index child_index)
+{
+	ListCell	*lc;
+	foreach(lc, root->append_rel_list)
+	{
+		AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
+		if (appinfo->child_relid == child_index)
+			return appinfo;
+	}
+	elog(ERROR, "parent relation cant be found");
+	return NULL;
+}
+
+/*
+ * find_parent_var
+ *
+ */
+Var *
+find_parent_var(AppendRelInfo *appinfo, Var *child_var)
+{
+	ListCell	*lc;
+	Var	*res = NULL;
+	Index attno = 1;
+	foreach(lc, appinfo->translated_vars)
+	{
+		Node *child_node = lfirst(lc);
+		if (equal(child_node, child_var))
+		{
+			res = copyObject(child_var);
+			res->varattno = attno;
+			res->varno = appinfo->parent_relid;
+		}
+		attno++;
+	}
+	if (res == NULL)
+		elog(ERROR, "parent var can't be found.");
+	return res;
+}
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 3132fd35a5..d66b40ec50 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -736,6 +736,7 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 		{
 			Node	   *onecq = (Node *) lfirst(lc2);
 			bool		pseudoconstant;
+			RestrictInfo	*child_rinfo;
 
 			/* check for pseudoconstant (no Vars or volatile functions) */
 			pseudoconstant =
@@ -747,13 +748,14 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 				root->hasPseudoConstantQuals = true;
 			}
 			/* reconstitute RestrictInfo with appropriate properties */
-			childquals = lappend(childquals,
-								 make_restrictinfo((Expr *) onecq,
-												   rinfo->is_pushed_down,
-												   rinfo->outerjoin_delayed,
-												   pseudoconstant,
-												   rinfo->security_level,
-												   NULL, NULL, NULL));
+			child_rinfo =  make_restrictinfo((Expr *) onecq,
+											 rinfo->is_pushed_down,
+											 rinfo->outerjoin_delayed,
+											 pseudoconstant,
+											 rinfo->security_level,
+											 NULL, NULL, NULL);
+			child_rinfo->mergeopfamilies = rinfo->mergeopfamilies;
+			childquals = lappend(childquals, child_rinfo);
 			/* track minimum security level among child quals */
 			cq_min_security = Min(cq_min_security, rinfo->security_level);
 		}
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index 31d9aedeeb..c83f17acb7 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -16,6 +16,7 @@
 
 #include "nodes/execnodes.h"
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 
 extern A_Expr *makeA_Expr(A_Expr_Kind kind, List *name,
@@ -105,4 +106,6 @@ extern GroupingSet *makeGroupingSet(GroupingSetKind kind, List *content, int loc
 
 extern VacuumRelation *makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols);
 
+extern UniqueKey* makeUniqueKey(List *exprs, bool multi_nullvals);
+
 #endif							/* MAKEFUNC_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 381d84b4e4..41110ed888 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -264,6 +264,7 @@ typedef enum NodeTag
 	T_EquivalenceMember,
 	T_PathKey,
 	T_PathTarget,
+	T_UniqueKey,
 	T_RestrictInfo,
 	T_IndexClause,
 	T_PlaceHolderVar,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 9e3ebd488a..02e4458bef 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -730,6 +730,7 @@ typedef struct RelOptInfo
 	QualCost	baserestrictcost;	/* cost of evaluating the above */
 	Index		baserestrict_min_security;	/* min security_level found in
 											 * baserestrictinfo */
+	List	   *uniquekeys;		/* List of UniqueKey */
 	List	   *joininfo;		/* RestrictInfo structures for join clauses
 								 * involving this rel */
 	bool		has_eclass_joins;	/* T means joininfo is incomplete */
@@ -1047,6 +1048,28 @@ typedef struct PathKey
 } PathKey;
 
 
+/*
+ * UniqueKey
+ *
+ * Represents the unique properties held by a RelOptInfo.
+ *
+ * exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+ * is a special case of UniqueKey, which means there is only 1 row in that
+ * relation.
+ * multi_nullvals: true means multi null values may exist in these exprs, so the
+ * uniqueness is not guaranteed in this case. This field is necessary for
+ * remove_useless_join & reduce_unique_semijoins where we don't mind these
+ * duplicated NULL values. It is set to true for 2 cases. One is a unique key
+ * from a unique index but the related column is nullable. The other one is for
+ * outer join. see populate_joinrel_uniquekeys for detail.
+ */
+typedef struct UniqueKey
+{
+	NodeTag		type;
+	List	   *exprs;
+	bool		multi_nullvals;
+} UniqueKey;
+
 /*
  * PathTarget
  *
@@ -2473,7 +2496,7 @@ typedef enum
  *
  * flags indicating what kinds of grouping are possible.
  * partial_costs_set is true if the agg_partial_costs and agg_final_costs
- * 		have been initialized.
+ *		have been initialized.
  * agg_partial_costs gives partial aggregation costs.
  * agg_final_costs gives finalization costs.
  * target_parallel_safe is true if target is parallel safe.
@@ -2503,8 +2526,8 @@ typedef struct
  * limit_tuples is an estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate.
  * count_est and offset_est are the estimated values of the LIMIT and OFFSET
- * 		expressions computed by preprocess_limit() (see comments for
- * 		preprocess_limit() for more information).
+ *		expressions computed by preprocess_limit() (see comments for
+ *		preprocess_limit() for more information).
  */
 typedef struct
 {
diff --git a/src/include/nodes/pg_list.h b/src/include/nodes/pg_list.h
index 14ea2766ad..621f54a9f8 100644
--- a/src/include/nodes/pg_list.h
+++ b/src/include/nodes/pg_list.h
@@ -528,6 +528,8 @@ extern bool list_member_ptr(const List *list, const void *datum);
 extern bool list_member_int(const List *list, int datum);
 extern bool list_member_oid(const List *list, Oid datum);
 
+extern bool list_is_subset(const List *members, const List *target);
+
 extern List *list_delete(List *list, void *datum);
 extern List *list_delete_ptr(List *list, void *datum);
 extern List *list_delete_int(List *list, int datum);
diff --git a/src/include/optimizer/appendinfo.h b/src/include/optimizer/appendinfo.h
index d6a27a60dd..e87c92a054 100644
--- a/src/include/optimizer/appendinfo.h
+++ b/src/include/optimizer/appendinfo.h
@@ -32,4 +32,7 @@ extern Relids adjust_child_relids_multilevel(PlannerInfo *root, Relids relids,
 extern AppendRelInfo **find_appinfos_by_relids(PlannerInfo *root,
 											   Relids relids, int *nappinfos);
 
+extern AppendRelInfo *find_appinfo_by_child(PlannerInfo *root, Index child_index);
+extern Var *find_parent_var(AppendRelInfo *appinfo, Var *child_var);
+
 #endif							/* APPENDINFO_H */
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index 3e4171056e..9445141263 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -23,6 +23,7 @@
 #define OPTIMIZER_H
 
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 /*
  * We don't want to include nodes/pathnodes.h here, because non-planner
@@ -156,6 +157,7 @@ extern TargetEntry *get_sortgroupref_tle(Index sortref,
 										 List *targetList);
 extern TargetEntry *get_sortgroupclause_tle(SortGroupClause *sgClause,
 											List *targetList);
+extern Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 extern Node *get_sortgroupclause_expr(SortGroupClause *sgClause,
 									  List *targetList);
 extern List *get_sortgrouplist_exprs(List *sgClauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 10b6e81079..9217a8d6c6 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -240,5 +240,48 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 									   int strategy, bool nulls_first);
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 									List *live_childrels);
+extern List *select_mergejoin_clauses(PlannerInfo *root,
+									  RelOptInfo *joinrel,
+									  RelOptInfo *outerrel,
+									  RelOptInfo *innerrel,
+									  List *restrictlist,
+									  JoinType jointype,
+									  bool *mergejoin_allowed);
+
+/*
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ */
+extern void populate_baserel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *baserel,
+										List* unique_index_list);
+extern void populate_partitionedrel_uniquekeys(PlannerInfo *root,
+												RelOptInfo *rel,
+												List *childrels);
+extern void populate_distinctrel_uniquekeys(PlannerInfo *root,
+											RelOptInfo *inputrel,
+											RelOptInfo *distinctrel);
+extern void populate_grouprel_uniquekeys(PlannerInfo *root,
+										 RelOptInfo *grouprel,
+										 RelOptInfo *inputrel);
+extern void populate_unionrel_uniquekeys(PlannerInfo *root,
+										  RelOptInfo *unionrel);
+extern void simple_copy_uniquekeys(RelOptInfo *oldrel,
+								   RelOptInfo *newrel);
+extern void convert_subquery_uniquekeys(PlannerInfo *root,
+										RelOptInfo *currel,
+										RelOptInfo *sub_final_rel);
+extern void populate_joinrel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *joinrel,
+										RelOptInfo *rel1,
+										RelOptInfo *rel2,
+										List *restrictlist,
+										JoinType jointype);
+
+extern bool relation_has_uniquekeys_for(PlannerInfo *root,
+										RelOptInfo *rel,
+										List *exprs,
+										bool allow_multinulls);
+extern bool relation_is_onerow(RelOptInfo *rel);
 
 #endif							/* PATHS_H */
-- 
2.27.0

v02-0001-Extend-UniqueKeys.patchapplication/octet-stream; name=v02-0001-Extend-UniqueKeys.patchDownload

From bf2b2b1f22130e7accb11fd47dea64e6808b7425 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:33:56 +0200
Subject: [PATCH 3/5] Extend UniqueKeys

Prepares index skip scan implementation using UniqueKeys. Allows to
specify what are the "requested" keys that should be unique, and add
them to necessary Paths to make them useful later.

Proposed by David Rowley, contains few bits out of previous version from
Jesper Pedersen.
---
 src/backend/optimizer/path/pathkeys.c   | 59 +++++++++++++++++++++++
 src/backend/optimizer/path/uniquekeys.c | 63 +++++++++++++++++++++++++
 src/backend/optimizer/plan/planner.c    | 36 +++++++++++++-
 src/backend/optimizer/util/pathnode.c   | 32 +++++++++----
 src/include/nodes/pathnodes.h           |  5 ++
 src/include/optimizer/pathnode.h        |  1 +
 src/include/optimizer/paths.h           |  8 ++++
 7 files changed, 194 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 7e596d4194..a4fc4f252d 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
 #include "utils/lsyscache.h"
 
 
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
@@ -95,6 +96,29 @@ make_canonical_pathkey(PlannerInfo *root,
 	return pk;
 }
 
+/*
+ * pathkey_is_unique
+ *	   Checks if the new pathkey's equivalence class is the same as that of
+ *     any existing member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+	EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+	ListCell   *lc;
+
+	/* If same EC already is already in the list, then not unique */
+	foreach(lc, pathkeys)
+	{
+		PathKey    *old_pathkey = (PathKey *) lfirst(lc);
+
+		if (new_ec == old_pathkey->pk_eclass)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * pathkey_is_redundant
  *	   Is a pathkey redundant with one already in the given list?
@@ -1151,6 +1175,41 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
 	return pathkeys;
 }
 
+/*
+ * make_pathkeys_for_uniquekeyclauses
+ *		Generate a pathkeys list to be used for uniquekey clauses
+ */
+List *
+make_pathkeys_for_uniquekeys(PlannerInfo *root,
+							 List *sortclauses,
+							 List *tlist)
+{
+	List	   *pathkeys = NIL;
+	ListCell   *l;
+
+	foreach(l, sortclauses)
+	{
+		SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+		Expr	   *sortkey;
+		PathKey    *pathkey;
+
+		sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+		Assert(OidIsValid(sortcl->sortop));
+		pathkey = make_pathkey_from_sortop(root,
+										   sortkey,
+										   root->nullable_baserels,
+										   sortcl->sortop,
+										   sortcl->nulls_first,
+										   sortcl->tleSortGroupRef,
+										   true);
+
+		if (pathkey_is_unique(pathkey, pathkeys))
+			pathkeys = lappend(pathkeys, pathkey);
+	}
+
+	return pathkeys;
+}
+
 /****************************************************************************
  *		PATHKEYS AND MERGECLAUSES
  ****************************************************************************/
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
index b33bcd2f32..4bc16ea023 100644
--- a/src/backend/optimizer/path/uniquekeys.c
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -1129,3 +1129,66 @@ add_combined_uniquekey(RelOptInfo *joinrel,
 	}
 	return false;
 }
+
+List*
+build_uniquekeys(PlannerInfo *root, List *sortclauses)
+{
+	List *result = NIL;
+	List *sortkeys;
+	ListCell *l;
+	List *exprs = NIL;
+
+	sortkeys = make_pathkeys_for_uniquekeys(root,
+											sortclauses,
+											root->processed_tlist);
+
+	/* Create a uniquekey and add it to the list */
+	foreach(l, sortkeys)
+	{
+		PathKey    *pathkey = (PathKey *) lfirst(l);
+		EquivalenceClass *ec = pathkey->pk_eclass;
+		EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members));
+		if (EC_MUST_BE_REDUNDANT(ec))
+			continue;
+		exprs = lappend(exprs, mem->em_expr);
+	}
+
+	result = lappend(result, makeUniqueKey(exprs, false));
+
+	return result;
+}
+
+bool
+query_has_uniquekeys_for(PlannerInfo *root, List *pathuniquekeys,
+						 bool allow_multinulls)
+{
+	ListCell *lc;
+	ListCell *lc2;
+
+	/* root->query_uniquekeys are the requested DISTINCT clauses on query level
+	 * pathuniquekeys are the unique keys on current path.
+	 * All requested query_uniquekeys must be satisfied by the pathuniquekeys
+	 */
+	foreach(lc, root->query_uniquekeys)
+	{
+		UniqueKey *query_ukey = lfirst_node(UniqueKey, lc);
+		bool satisfied = false;
+		foreach(lc2, pathuniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc2);
+			if (ukey->multi_nullvals && !allow_multinulls)
+				continue;
+			if (list_length(ukey->exprs) == 0 &&
+				list_length(query_ukey->exprs) != 0)
+				continue;
+			if (list_is_subset(ukey->exprs, query_ukey->exprs))
+			{
+				satisfied = true;
+				break;
+			}
+		}
+		if (!satisfied)
+			return false;
+	}
+	return true;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 0551ae0512..58386d5040 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3606,12 +3606,18 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 	 */
 	if (qp_extra->groupClause &&
 		grouping_is_sortable(qp_extra->groupClause))
+	{
 		root->group_pathkeys =
 			make_pathkeys_for_sortclauses(root,
 										  qp_extra->groupClause,
 										  tlist);
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
+	{
 		root->group_pathkeys = NIL;
+		root->query_uniquekeys = NIL;
+	}
 
 	/* We consider only the first (bottom) window in pathkeys logic */
 	if (activeWindows != NIL)
@@ -4815,13 +4821,19 @@ create_distinct_paths(PlannerInfo *root,
 			Path	   *path = (Path *) lfirst(lc);
 
 			if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
-			{
 				add_path(distinct_rel, (Path *)
 						 create_upper_unique_path(root, distinct_rel,
 												  path,
 												  list_length(root->distinct_pathkeys),
 												  numDistinctRows));
-			}
+		}
+
+		foreach(lc, input_rel->unique_pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+
+			if (query_has_uniquekeys_for(root, needed_pathkeys, false))
+				add_path(distinct_rel, path);
 		}
 
 		/* For explicit-sort case, always use the more rigorous clause */
@@ -7517,6 +7529,26 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		}
 	}
 
+	foreach(lc, rel->unique_pathlist)
+	{
+		Path	   *subpath = (Path *) lfirst(lc);
+
+		/* Shouldn't have any parameterized paths anymore */
+		Assert(subpath->param_info == NULL);
+
+		if (tlist_same_exprs)
+			subpath->pathtarget->sortgrouprefs =
+				scanjoin_target->sortgrouprefs;
+		else
+		{
+			Path	   *newpath;
+
+			newpath = (Path *) create_projection_path(root, rel, subpath,
+													  scanjoin_target);
+			lfirst(lc) = newpath;
+		}
+	}
+
 	/*
 	 * Now, if final scan/join target contains SRFs, insert ProjectSetPath(s)
 	 * atop each existing path.  (Note that this function doesn't look at the
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 5110a6b806..d4abb3cb47 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -416,10 +416,10 @@ set_cheapest(RelOptInfo *parent_rel)
  * 'parent_rel' is the relation entry to which the path corresponds.
  * 'new_path' is a potential path for parent_rel.
  *
- * Returns nothing, but modifies parent_rel->pathlist.
+ * Returns modified pathlist.
  */
-void
-add_path(RelOptInfo *parent_rel, Path *new_path)
+static List *
+add_path_to(RelOptInfo *parent_rel, List *pathlist, Path *new_path)
 {
 	bool		accept_new = true;	/* unless we find a superior old path */
 	int			insert_at = 0;	/* where to insert new item */
@@ -440,7 +440,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	 * for more than one old path to be tossed out because new_path dominates
 	 * it.
 	 */
-	foreach(p1, parent_rel->pathlist)
+	foreach(p1, pathlist)
 	{
 		Path	   *old_path = (Path *) lfirst(p1);
 		bool		remove_old = false; /* unless new proves superior */
@@ -584,8 +584,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		 */
 		if (remove_old)
 		{
-			parent_rel->pathlist = foreach_delete_current(parent_rel->pathlist,
-														  p1);
+			pathlist = foreach_delete_current(pathlist, p1);
 
 			/*
 			 * Delete the data pointed-to by the deleted cell, if possible
@@ -612,8 +611,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	if (accept_new)
 	{
 		/* Accept the new path: insert it at proper place in pathlist */
-		parent_rel->pathlist =
-			list_insert_nth(parent_rel->pathlist, insert_at, new_path);
+		pathlist = list_insert_nth(pathlist, insert_at, new_path);
 	}
 	else
 	{
@@ -621,6 +619,23 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		if (!IsA(new_path, IndexPath))
 			pfree(new_path);
 	}
+
+	return pathlist;
+}
+
+void
+add_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->pathlist = add_path_to(parent_rel,
+									   parent_rel->pathlist, new_path);
+}
+
+void
+add_unique_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->unique_pathlist = add_path_to(parent_rel,
+											  parent_rel->unique_pathlist,
+											  new_path);
 }
 
 /*
@@ -2571,6 +2586,7 @@ create_projection_path(PlannerInfo *root,
 	pathnode->path.pathkeys = subpath->pathkeys;
 
 	pathnode->subpath = subpath;
+	pathnode->path.uniquekeys = subpath->uniquekeys;
 
 	/*
 	 * We might not need a separate Result node.  If the input plan node type
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 02e4458bef..a5c406bd4e 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -297,6 +297,7 @@ struct PlannerInfo
 
 	List	   *query_pathkeys; /* desired pathkeys for query_planner() */
 
+	List	   *query_uniquekeys; /* unique keys required for the query */
 	List	   *group_pathkeys; /* groupClause pathkeys, if any */
 	List	   *window_pathkeys;	/* pathkeys of bottom window, if any */
 	List	   *distinct_pathkeys;	/* distinctClause pathkeys, if any */
@@ -679,6 +680,7 @@ typedef struct RelOptInfo
 	List	   *pathlist;		/* Path structures */
 	List	   *ppilist;		/* ParamPathInfos used in pathlist */
 	List	   *partial_pathlist;	/* partial Paths */
+	List       *unique_pathlist;    /* unique Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
 	struct Path *cheapest_unique_path;
@@ -866,6 +868,7 @@ struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanskip;		/* can AM skip duplicate values? */
 	bool		amcanparallel;	/* does AM support parallel scan? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
 	void		(*amcostestimate) ();	/* AM's cost estimator */
@@ -1182,6 +1185,8 @@ typedef struct Path
 
 	List	   *pathkeys;		/* sort ordering of path's output */
 	/* pathkeys is a List of PathKey nodes; see above */
+
+	List	   *uniquekeys;	/* the unique keys, or NIL if none */
 } Path;
 
 /* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 715a24ad29..6796ad8cb7 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -27,6 +27,7 @@ extern int	compare_fractional_path_costs(Path *path1, Path *path2,
 										  double fraction);
 extern void set_cheapest(RelOptInfo *parent_rel);
 extern void add_path(RelOptInfo *parent_rel, Path *new_path);
+extern void add_unique_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_path_precheck(RelOptInfo *parent_rel,
 							  Cost startup_cost, Cost total_cost,
 							  List *pathkeys, Relids required_outer);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 9217a8d6c6..0cb8030e33 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -215,6 +215,9 @@ extern List *build_join_pathkeys(PlannerInfo *root,
 extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
 										   List *sortclauses,
 										   List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+										  List *sortclauses,
+										  List *tlist);
 extern void initialize_mergeclause_eclasses(PlannerInfo *root,
 											RestrictInfo *restrictinfo);
 extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -282,6 +285,11 @@ extern bool relation_has_uniquekeys_for(PlannerInfo *root,
 										RelOptInfo *rel,
 										List *exprs,
 										bool allow_multinulls);
+extern bool query_has_uniquekeys_for(PlannerInfo *root,
+									 List *exprs,
+									 bool allow_multinulls);
 extern bool relation_is_onerow(RelOptInfo *rel);
 
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+
 #endif							/* PATHS_H */
-- 
2.27.0

v02-0002-Index-skip-scan.patchapplication/octet-stream; name=v02-0002-Index-skip-scan.patchDownload

From dcc4b94e9e739a55d4b7d4e65a4f594ea7f04456 Mon Sep 17 00:00:00 2001
From: Floris van Nee <floris.vannee@gmail.com>
Date: Fri, 15 Nov 2019 09:46:53 -0500
Subject: [PATCH 4/5] Index skip scan

Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
as part of the IndexOnlyScan, IndexScan and BitmapIndexScan for nbtree.
This patch improves performance of two main types of queries significantly:
- SELECT DISTINCT, SELECT DISTINCT ON
- Regular SELECTs with WHERE-clauses on non-leading index attributes
For example, given an nbtree index on three columns (a,b,c), the following queries
may now be significantly faster:
- SELECT DISTINCT ON (a) * FROM t1
- SELECT * FROM t1 WHERE b=2
- SELECT * FROM t1 WHERE b IN (10,40)
- SELECT DISTINCT ON (a,b) * FROM t1 WHERE c BETWEEN 10 AND 100 ORDER BY a,b,c

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen. Further enhanced functionality
added by Floris van Nee regarding a more general and performant skip implementation.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com

Author: Floris van Nee, Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 contrib/amcheck/verify_nbtree.c               |    4 +-
 contrib/bloom/blutils.c                       |    3 +
 doc/src/sgml/config.sgml                      |   15 +
 doc/src/sgml/indexam.sgml                     |  121 +-
 doc/src/sgml/indices.sgml                     |   28 +
 src/backend/access/brin/brin.c                |    3 +
 src/backend/access/gin/ginutil.c              |    3 +
 src/backend/access/gist/gist.c                |    3 +
 src/backend/access/hash/hash.c                |    3 +
 src/backend/access/index/indexam.c            |  163 ++
 src/backend/access/nbtree/Makefile            |    1 +
 src/backend/access/nbtree/nbtinsert.c         |    2 +-
 src/backend/access/nbtree/nbtpage.c           |    2 +-
 src/backend/access/nbtree/nbtree.c            |   56 +-
 src/backend/access/nbtree/nbtsearch.c         |  790 ++++------
 src/backend/access/nbtree/nbtskip.c           | 1332 +++++++++++++++++
 src/backend/access/nbtree/nbtsort.c           |    2 +-
 src/backend/access/nbtree/nbtutils.c          |  821 +++++++++-
 src/backend/access/spgist/spgutils.c          |    3 +
 src/backend/commands/explain.c                |   29 +
 src/backend/executor/execScan.c               |   35 +-
 src/backend/executor/nodeBitmapIndexscan.c    |   21 +-
 src/backend/executor/nodeIndexonlyscan.c      |   69 +-
 src/backend/executor/nodeIndexscan.c          |   71 +-
 src/backend/nodes/copyfuncs.c                 |    5 +
 src/backend/nodes/outfuncs.c                  |    6 +
 src/backend/nodes/readfuncs.c                 |    5 +
 src/backend/optimizer/path/costsize.c         |    1 +
 src/backend/optimizer/path/indxpath.c         |   68 +
 src/backend/optimizer/path/pathkeys.c         |   72 +
 src/backend/optimizer/plan/createplan.c       |   38 +-
 src/backend/optimizer/plan/planner.c          |    8 +-
 src/backend/optimizer/util/pathnode.c         |   38 +
 src/backend/optimizer/util/plancat.c          |    3 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 src/backend/utils/sort/tuplesort.c            |    4 +-
 src/include/access/amapi.h                    |   19 +
 src/include/access/genam.h                    |   16 +
 src/include/access/nbtree.h                   |  143 +-
 src/include/executor/executor.h               |    4 +
 src/include/nodes/execnodes.h                 |    7 +
 src/include/nodes/pathnodes.h                 |    5 +
 src/include/nodes/plannodes.h                 |    5 +
 src/include/optimizer/cost.h                  |    1 +
 src/include/optimizer/pathnode.h              |    4 +
 src/include/optimizer/paths.h                 |    4 +
 src/interfaces/libpq/encnames.c               |    1 +
 src/interfaces/libpq/wchar.c                  |    1 +
 src/test/regress/expected/select_distinct.out |  599 ++++++++
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/sql/select_distinct.sql      |  248 +++
 52 files changed, 4354 insertions(+), 544 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtskip.c
 create mode 120000 src/interfaces/libpq/encnames.c
 create mode 120000 src/interfaces/libpq/wchar.c

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index e4d501a85d..73d7b63b3b 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2508,7 +2508,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, itup, NULL);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -2944,7 +2944,7 @@ bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, itup, NULL);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index d3bf8665df..6170e06289 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -134,6 +134,9 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = blcostestimate;
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ca6a3a523f..25d45c81c4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4648,6 +4648,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+      <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of index-skip-scan plan
+        types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+        <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-material" xreflabel="enable_material">
       <term><varname>enable_material</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index af87f172a7..4ec3605b8f 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -150,6 +150,9 @@ typedef struct IndexAmRoutine
     amendscan_function amendscan;
     ammarkpos_function ammarkpos;       /* can be NULL */
     amrestrpos_function amrestrpos;     /* can be NULL */
+    amskip_function amskip;                        /* can be NULL */
+    ambeginscan_skip_function ambeginskipscan;     /* can be NULL */
+    amgettuple_with_skip_function amgetskiptuple;  /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -676,6 +679,122 @@ amrestrpos (IndexScanDesc scan);
    struct may be set to NULL.
   </para>
 
+  <para>
+<programlisting>
+bool
+amskip (IndexScanDesc scan,
+        ScanDirection prefixDir,
+	ScanDirection postfixDir);
+</programlisting>
+  Skip past all tuples where the first 'prefix' columns have the same value as
+  the last tuple returned in the current scan. The arguments are:
+
+   <variablelist>
+    <varlistentry>
+     <term><parameter>scan</parameter></term>
+     <listitem>
+      <para>
+       Index scan information
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>prefixDir</parameter></term>
+     <listitem>
+      <para>
+       The direction in which the prefix part of the tuple is advancing.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>postfixDir</parameter></term>
+     <listitem>
+      <para>
+        The direction in which the postfix (everything after the prefix) of the tuple is advancing.
+      </para>
+     </listitem>
+    </varlistentry>
+   </variablelist>
+
+  </para>
+  <para>
+<programlisting>
+IndexScanDesc
+ambeginscan_skip (Relation indexRelation,
+             int nkeys,
+	     int norderbys,
+	     int prefix);
+</programlisting>
+   Prepare for an index scan.  The <literal>nkeys</literal> and <literal>norderbys</literal>
+   parameters indicate the number of quals and ordering operators that will be
+   used in the scan; these may be useful for space allocation purposes.
+   Note that the actual values of the scan keys aren't provided yet.
+   The result must be a palloc'd struct.
+   For implementation reasons the index access method
+   <emphasis>must</emphasis> create this struct by calling
+   <function>RelationGetIndexScan()</function>.  In most cases
+   <function>ambeginscan</function> does little beyond making that call and perhaps
+   acquiring locks;
+   the interesting parts of index-scan startup are in <function>amrescan</function>.
+   If this is a skip scan, prefix must indicate the length of the prefix that can be
+   skipped over. Prefix can be set to -1 to disable skipping, which will yield an
+   identical scan to a regular call to <function>ambeginscan</function>.
+  </para>
+  <para>
+  <programlisting>
+  boolean
+  amgettuple_skip (IndexScanDesc scan,
+              ScanDirection prefixDir,
+	      ScanDirection postfixDir);
+  </programlisting>
+     Fetch the next tuple in the given scan, moving in the given
+     directions. Directions are specified by the direction of the prefix we're moving in,
+     of which the size of the prefix has been specified in the <function>btbeginscan_skip</function>
+     call. This direction can be different in DISTINCT scans when fetching backwards
+     from a cursor.
+     Returns true if a tuple was
+     obtained, false if no matching tuples remain.  In the true case the tuple
+     TID is stored into the <literal>scan</literal> structure.  Note that
+     <quote>success</quote> means only that the index contains an entry that matches
+     the scan keys, not that the tuple necessarily still exists in the heap or
+     will pass the caller's snapshot test.  On success, <function>amgettuple</function>
+     must also set <literal>scan-&gt;xs_recheck</literal> to true or false.
+     False means it is certain that the index entry matches the scan keys.
+     true means this is not certain, and the conditions represented by the
+     scan keys must be rechecked against the heap tuple after fetching it.
+     This provision supports <quote>lossy</quote> index operators.
+     Note that rechecking will extend only to the scan conditions; a partial
+     index predicate (if any) is never rechecked by <function>amgettuple</function>
+     callers.
+    </para>
+
+    <para>
+     If the index supports <link linkend="indexes-index-only-scans">index-only
+     scans</link> (i.e., <function>amcanreturn</function> returns true for it),
+     then on success the AM must also check <literal>scan-&gt;xs_want_itup</literal>,
+     and if that is true it must return the originally indexed data for the
+     index entry.  The data can be returned in the form of an
+     <structname>IndexTuple</structname> pointer stored at <literal>scan-&gt;xs_itup</literal>,
+     with tuple descriptor <literal>scan-&gt;xs_itupdesc</literal>; or in the form of
+     a <structname>HeapTuple</structname> pointer stored at <literal>scan-&gt;xs_hitup</literal>,
+     with tuple descriptor <literal>scan-&gt;xs_hitupdesc</literal>.  (The latter
+     format should be used when reconstructing data that might possibly not fit
+     into an <structname>IndexTuple</structname>.)  In either case,
+     management of the data referenced by the pointer is the access method's
+     responsibility.  The data must remain good at least until the next
+     <function>amgettuple</function>, <function>amrescan</function>, or <function>amendscan</function>
+     call for the scan.
+    </para>
+
+    <para>
+     The <function>amgettuple</function> function need only be provided if the access
+     method supports <quote>plain</quote> index scans.  If it doesn't, the
+     <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
+     struct must be set to NULL.
+    </para>
+
   <para>
    In addition to supporting ordinary index scans, some types of index
    may wish to support <firstterm>parallel index scans</firstterm>, which allow
@@ -691,7 +810,7 @@ amrestrpos (IndexScanDesc scan);
    functions may be implemented to support parallel index scans:
   </para>
 
-  <para>
+    <para>
 <programlisting>
 Size
 amestimateparallelscan (void);
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 28adaba72d..1eed07adfc 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1281,6 +1281,34 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
    and later will recognize such cases and allow index-only scans to be
    generated, but older versions will not.
   </para>
+
+  <sect2 id="indexes-index-skip-scans">
+    <title>Index Skip Scans</title>
+
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index</primary>
+      <secondary>index-skip scans</secondary>
+    </indexterm>
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index-skip scan</primary>
+    </indexterm>
+
+    <para>
+     When the rows retrieved from an index scan are then deduplicated by
+     eliminating rows matching on a prefix of index keys (e.g. when using
+     <literal>SELECT DISTINCT</literal>), the planner will consider
+     skipping groups of rows with a matching key prefix. When a row with
+     a particular prefix is found, remaining rows with the same key prefix
+     are skipped.  The larger the number of rows with the same key prefix
+     rows (i.e. the lower the number of distinct key prefixes in the index),
+     the more efficient this is.
+    </para>
+    <para>
+      Additionally, a skip scan can be considered in regular <literal>SELECT</literal>
+      queries. When filtering on an non-leading attribute of an index, the planner
+      may choose a skip scan.
+    </para>
+  </sect2>
  </sect1>
 
 
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 7db3ae5ee0..742fa3f634 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -115,6 +115,9 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = brincostestimate;
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index a400f1fedb..9ffd0fb173 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -66,6 +66,9 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = gincostestimate;
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 79fe6eb8d6..36dc5a6c2c 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -87,6 +87,9 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = gistcostestimate;
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 3ec6d528e7..1bedd55b06 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -84,6 +84,9 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = hashcostestimate;
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 6b9750c244..277db46f8a 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -14,7 +14,9 @@
  *		index_open		- open an index relation by relation OID
  *		index_close		- close an index relation
  *		index_beginscan - start a scan of an index with amgettuple
+ *		index_beginscan_skip - start a scan of an index with amgettuple and skipping
  *		index_beginscan_bitmap - start a scan of an index with amgetbitmap
+ *		index_beginscan_bitmap_skip - start a skip scan of an index with amgetbitmap
  *		index_rescan	- restart a scan of an index
  *		index_endscan	- end a scan
  *		index_insert	- insert an index tuple into a relation
@@ -25,14 +27,17 @@
  *		index_parallelrescan  - (re)start a parallel scan of an index
  *		index_beginscan_parallel - join parallel index scan
  *		index_getnext_tid	- get the next TID from a scan
+ *		index_getnext_tid_skip	- get the next TID from a skip scan
  *		index_fetch_heap		- get the scan's next heap tuple
  *		index_getnext_slot	- get the next tuple from a scan
+ *		index_getnext_slot	- get the next tuple from a skip scan
  *		index_getbitmap - get all tuples from a scan
  *		index_bulk_delete	- bulk deletion of index tuples
  *		index_vacuum_cleanup	- post-deletion cleanup of an index
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_skip		- advance past duplicate key values in a scan
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -222,6 +227,78 @@ index_beginscan(Relation heapRelation,
 	return scan;
 }
 
+static IndexScanDesc
+index_beginscan_internal_skip(Relation indexRelation,
+						 int nkeys, int norderbys, int prefix, Snapshot snapshot,
+						 ParallelIndexScanDesc pscan, bool temp_snap)
+{
+	IndexScanDesc scan;
+
+	RELATION_CHECKS;
+	CHECK_REL_PROCEDURE(ambeginskipscan);
+
+	if (!(indexRelation->rd_indam->ampredlocks))
+		PredicateLockRelation(indexRelation, snapshot);
+
+	/*
+	 * We hold a reference count to the relcache entry throughout the scan.
+	 */
+	RelationIncrementReferenceCount(indexRelation);
+
+	/*
+	 * Tell the AM to open a scan.
+	 */
+	scan = indexRelation->rd_indam->ambeginskipscan(indexRelation, nkeys,
+												norderbys, prefix);
+	/* Initialize information for parallel scan. */
+	scan->parallel_scan = pscan;
+	scan->xs_temp_snap = temp_snap;
+
+	return scan;
+}
+
+IndexScanDesc
+index_beginscan_skip(Relation heapRelation,
+				Relation indexRelation,
+				Snapshot snapshot,
+				int nkeys, int norderbys, int prefix)
+{
+	IndexScanDesc scan;
+
+	scan = index_beginscan_internal_skip(indexRelation, nkeys, norderbys, prefix, snapshot, NULL, false);
+
+	/*
+	 * Save additional parameters into the scandesc.  Everything else was set
+	 * up by RelationGetIndexScan.
+	 */
+	scan->heapRelation = heapRelation;
+	scan->xs_snapshot = snapshot;
+
+	/* prepare to fetch index matches from table */
+	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+
+	return scan;
+}
+
+IndexScanDesc
+index_beginscan_bitmap_skip(Relation indexRelation,
+					   Snapshot snapshot,
+					   int nkeys,
+					   int prefix)
+{
+	IndexScanDesc scan;
+
+	scan = index_beginscan_internal_skip(indexRelation, nkeys, 0, prefix, snapshot, NULL, false);
+
+	/*
+	 * Save additional parameters into the scandesc.  Everything else was set
+	 * up by RelationGetIndexScan.
+	 */
+	scan->xs_snapshot = snapshot;
+
+	return scan;
+}
+
 /*
  * index_beginscan_bitmap - start a scan of an index with amgetbitmap
  *
@@ -550,6 +627,45 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return &scan->xs_heaptid;
 }
 
+ItemPointer
+index_getnext_tid_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	bool		found;
+
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgetskiptuple);
+
+	Assert(TransactionIdIsValid(RecentGlobalXmin));
+
+	/*
+	 * The AM's amgettuple proc finds the next index entry matching the scan
+	 * keys, and puts the TID into scan->xs_heaptid.  It should also set
+	 * scan->xs_recheck and possibly scan->xs_itup/scan->xs_hitup, though we
+	 * pay no attention to those fields here.
+	 */
+	found = scan->indexRelation->rd_indam->amgetskiptuple(scan, prefixDir, postfixDir);
+
+	/* Reset kill flag immediately for safety */
+	scan->kill_prior_tuple = false;
+	scan->xs_heap_continue = false;
+
+	/* If we're out of index entries, we're done */
+	if (!found)
+	{
+		/* release resources (like buffer pins) from table accesses */
+		if (scan->xs_heapfetch)
+			table_index_fetch_reset(scan->xs_heapfetch);
+
+		return NULL;
+	}
+	Assert(ItemPointerIsValid(&scan->xs_heaptid));
+
+	pgstat_count_index_tuples(scan->indexRelation, 1);
+
+	/* Return the TID of the tuple we found. */
+	return &scan->xs_heaptid;
+}
+
 /* ----------------
  *		index_fetch_heap - get the scan's next heap tuple
  *
@@ -641,6 +757,38 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
 	return false;
 }
 
+bool
+index_getnext_slot_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir, TupleTableSlot *slot)
+{
+	for (;;)
+	{
+		if (!scan->xs_heap_continue)
+		{
+			ItemPointer tid;
+
+			/* Time to fetch the next TID from the index */
+			tid = index_getnext_tid_skip(scan, prefixDir, postfixDir);
+
+			/* If we're out of index entries, we're done */
+			if (tid == NULL)
+				break;
+
+			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+		}
+
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scan->xs_heaptid));
+		if (index_fetch_heap(scan, slot))
+			return true;
+	}
+
+	return false;
+}
+
 /* ----------------
  *		index_getbitmap - get all tuples at once from an index scan
  *
@@ -736,6 +884,21 @@ index_can_return(Relation indexRelation, int attno)
 	return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
 }
 
+/* ----------------
+ *		index_skip
+ *
+ *		Skip past all tuples where the first 'prefix' columns have the
+ *		same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	SCAN_CHECKS;
+
+	return scan->indexRelation->rd_indam->amskip(scan, prefixDir, postfixDir);
+}
+
 /* ----------------
  *		index_getprocid
  *
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index d69808e78c..da96ac00a6 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -19,6 +19,7 @@ OBJS = \
 	nbtpage.o \
 	nbtree.o \
 	nbtsearch.o \
+	nbtskip.o \
 	nbtsort.o \
 	nbtsplitloc.o \
 	nbtutils.o \
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index e3a44bc09e..165847c290 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -89,7 +89,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, itup, NULL);
 
 	if (checkingunique)
 	{
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 70bac0052f..ac34bec460 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1701,7 +1701,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, targetkey, NULL);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
 				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index d65f4357cc..9d19eab5f0 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -136,14 +136,17 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
+	amroutine->amskip = btskip;
 	amroutine->amcostestimate = btcostestimate;
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
 	amroutine->ambuildphasename = btbuildphasename;
 	amroutine->amvalidate = btvalidate;
 	amroutine->ambeginscan = btbeginscan;
+	amroutine->ambeginskipscan = btbeginscan_skip;
 	amroutine->amrescan = btrescan;
 	amroutine->amgettuple = btgettuple;
+	amroutine->amgetskiptuple = btgettuple_skip;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
@@ -219,6 +222,15 @@ btinsert(Relation rel, Datum *values, bool *isnull,
  */
 bool
 btgettuple(IndexScanDesc scan, ScanDirection dir)
+{
+	return btgettuple_skip(scan, dir, dir);
+}
+
+/*
+ *	btgettuple() -- Get the next tuple in the scan.
+ */
+bool
+btgettuple_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		res;
@@ -237,7 +249,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (so->numArrayKeys < 0)
 			return false;
 
-		_bt_start_array_keys(scan, dir);
+		_bt_start_array_keys(scan, prefixDir);
 	}
 
 	/* This loop handles advancing to the next array elements, if any */
@@ -249,7 +261,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_first() to get the first item in the scan.
 		 */
 		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+			res = _bt_first(scan, prefixDir, postfixDir);
 		else
 		{
 			/*
@@ -276,14 +288,14 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 			/*
 			 * Now continue the scan.
 			 */
-			res = _bt_next(scan, dir);
+			res = _bt_next(scan, prefixDir, postfixDir);
 		}
 
 		/* If we have a tuple, return it ... */
 		if (res)
 			break;
 		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+	} while (so->numArrayKeys && _bt_advance_array_keys(scan, prefixDir));
 
 	return res;
 }
@@ -314,7 +326,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	do
 	{
 		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		if (_bt_first(scan, ForwardScanDirection, ForwardScanDirection))
 		{
 			/* Save tuple ID, and continue scanning */
 			heapTid = &scan->xs_heaptid;
@@ -330,7 +342,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 				if (++so->currPos.itemIndex > so->currPos.lastItem)
 				{
 					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					if (!_bt_next(scan, ForwardScanDirection, ForwardScanDirection))
 						break;
 				}
 
@@ -351,6 +363,16 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
  */
 IndexScanDesc
 btbeginscan(Relation rel, int nkeys, int norderbys)
+{
+	return btbeginscan_skip(rel, nkeys, norderbys, -1);
+}
+
+
+/*
+ *	btbeginscan() -- start a scan on a btree index
+ */
+IndexScanDesc
+btbeginscan_skip(Relation rel, int nkeys, int norderbys, int skipPrefix)
 {
 	IndexScanDesc scan;
 	BTScanOpaque so;
@@ -385,10 +407,18 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	 */
 	so->currTuples = so->markTuples = NULL;
 
+	so->skipData = NULL;
+
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
 
+	if (skipPrefix > 0)
+	{
+		so->skipData = (BTSkip) palloc0(sizeof(BTSkipData));
+		so->skipData->prefix = skipPrefix;
+	}
+
 	return scan;
 }
 
@@ -452,6 +482,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	_bt_preprocess_array_keys(scan);
 }
 
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	return _bt_skip(scan, prefixDir, postfixDir);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -485,6 +524,8 @@ btendscan(IndexScanDesc scan)
 	if (so->currTuples != NULL)
 		pfree(so->currTuples);
 	/* so->markTuples should not be pfree'd, see btrescan */
+	if (_bt_skip_enabled(so))
+		pfree(so->skipData);
 	pfree(so);
 }
 
@@ -568,6 +609,9 @@ btrestrpos(IndexScanDesc scan)
 			if (so->currTuples)
 				memcpy(so->currTuples, so->markTuples,
 					   so->markPos.nextTupleOffset);
+			if (so->skipData)
+				memcpy(&so->skipData->curPos, &so->skipData->markPos,
+					   sizeof(BTSkipPosData));
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 28dc196b55..b5bbc5bfa2 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -17,19 +17,17 @@
 
 #include "access/nbtree.h"
 #include "access/relscan.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/predicate.h"
+#include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
 
-static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
 static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
@@ -38,14 +36,12 @@ static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
 static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
 									   OffsetNumber offnum,
 									   ItemPointer heapTid, int tupleOffset);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline bool _bt_checkkeys_extended(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+										  ScanDirection dir, bool isRegularMode,
+										  bool *continuescan, int *prefixskipindex);
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -61,7 +57,7 @@ static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
  * will remain in shared memory for as long as it takes to scan the index
  * buffer page.
  */
-static void
+void
 _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
 {
 	_bt_unlockbuf(scan->indexRelation, sp->buf);
@@ -339,7 +335,7 @@ _bt_moveright(Relation rel,
  * the given page.  _bt_binsrch() has no lock or refcount side effects
  * on the buffer.
  */
-static OffsetNumber
+OffsetNumber
 _bt_binsrch(Relation rel,
 			BTScanInsert key,
 			Buffer buf)
@@ -845,25 +841,23 @@ _bt_compare(Relation rel,
  * in locating the scan start position.
  */
 bool
-_bt_first(IndexScanDesc scan, ScanDirection dir)
+_bt_first(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
 	OffsetNumber offnum;
-	StrategyNumber strat;
-	bool		nextkey;
 	bool		goback;
 	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
-	int			i;
 	bool		status = true;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
 	BlockNumber blkno;
+	IndexTuple itup;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -900,184 +894,13 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 		else if (blkno != InvalidBlockNumber)
 		{
-			if (!_bt_parallel_readpage(scan, blkno, dir))
+			if (!_bt_parallel_readpage(scan, blkno, prefixDir))
 				return false;
 			goto readcomplete;
 		}
 	}
 
-	/*----------
-	 * Examine the scan keys to discover where we need to start the scan.
-	 *
-	 * We want to identify the keys that can be used as starting boundaries;
-	 * these are =, >, or >= keys for a forward scan or =, <, <= keys for
-	 * a backwards scan.  We can use keys for multiple attributes so long as
-	 * the prior attributes had only =, >= (resp. =, <=) keys.  Once we accept
-	 * a > or < boundary or find an attribute with no boundary (which can be
-	 * thought of as the same as "> -infinity"), we can't use keys for any
-	 * attributes to its right, because it would break our simplistic notion
-	 * of what initial positioning strategy to use.
-	 *
-	 * When the scan keys include cross-type operators, _bt_preprocess_keys
-	 * may not be able to eliminate redundant keys; in such cases we will
-	 * arbitrarily pick a usable one for each attribute.  This is correct
-	 * but possibly not optimal behavior.  (For example, with keys like
-	 * "x >= 4 AND x >= 5" we would elect to scan starting at x=4 when
-	 * x=5 would be more efficient.)  Since the situation only arises given
-	 * a poorly-worded query plus an incomplete opfamily, live with it.
-	 *
-	 * When both equality and inequality keys appear for a single attribute
-	 * (again, only possible when cross-type operators appear), we *must*
-	 * select one of the equality keys for the starting point, because
-	 * _bt_checkkeys() will stop the scan as soon as an equality qual fails.
-	 * For example, if we have keys like "x >= 4 AND x = 10" and we elect to
-	 * start at x=4, we will fail and stop before reaching x=10.  If multiple
-	 * equality quals survive preprocessing, however, it doesn't matter which
-	 * one we use --- by definition, they are either redundant or
-	 * contradictory.
-	 *
-	 * Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
-	 * If the index stores nulls at the end of the index we'll be starting
-	 * from, and we have no boundary key for the column (which means the key
-	 * we deduced NOT NULL from is an inequality key that constrains the other
-	 * end of the index), then we cons up an explicit SK_SEARCHNOTNULL key to
-	 * use as a boundary key.  If we didn't do this, we might find ourselves
-	 * traversing a lot of null entries at the start of the scan.
-	 *
-	 * In this loop, row-comparison keys are treated the same as keys on their
-	 * first (leftmost) columns.  We'll add on lower-order columns of the row
-	 * comparison below, if possible.
-	 *
-	 * The selected scan keys (at most one per index column) are remembered by
-	 * storing their addresses into the local startKeys[] array.
-	 *----------
-	 */
-	strat_total = BTEqualStrategyNumber;
-	if (so->numberOfKeys > 0)
-	{
-		AttrNumber	curattr;
-		ScanKey		chosen;
-		ScanKey		impliesNN;
-		ScanKey		cur;
-
-		/*
-		 * chosen is the so-far-chosen key for the current attribute, if any.
-		 * We don't cast the decision in stone until we reach keys for the
-		 * next attribute.
-		 */
-		curattr = 1;
-		chosen = NULL;
-		/* Also remember any scankey that implies a NOT NULL constraint */
-		impliesNN = NULL;
-
-		/*
-		 * Loop iterates from 0 to numberOfKeys inclusive; we use the last
-		 * pass to handle after-last-key processing.  Actual exit from the
-		 * loop is at one of the "break" statements below.
-		 */
-		for (cur = so->keyData, i = 0;; cur++, i++)
-		{
-			if (i >= so->numberOfKeys || cur->sk_attno != curattr)
-			{
-				/*
-				 * Done looking at keys for curattr.  If we didn't find a
-				 * usable boundary key, see if we can deduce a NOT NULL key.
-				 */
-				if (chosen == NULL && impliesNN != NULL &&
-					((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
-					 ScanDirectionIsForward(dir) :
-					 ScanDirectionIsBackward(dir)))
-				{
-					/* Yes, so build the key in notnullkeys[keysCount] */
-					chosen = &notnullkeys[keysCount];
-					ScanKeyEntryInitialize(chosen,
-										   (SK_SEARCHNOTNULL | SK_ISNULL |
-											(impliesNN->sk_flags &
-											 (SK_BT_DESC | SK_BT_NULLS_FIRST))),
-										   curattr,
-										   ((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
-											BTGreaterStrategyNumber :
-											BTLessStrategyNumber),
-										   InvalidOid,
-										   InvalidOid,
-										   InvalidOid,
-										   (Datum) 0);
-				}
-
-				/*
-				 * If we still didn't find a usable boundary key, quit; else
-				 * save the boundary key pointer in startKeys.
-				 */
-				if (chosen == NULL)
-					break;
-				startKeys[keysCount++] = chosen;
-
-				/*
-				 * Adjust strat_total, and quit if we have stored a > or <
-				 * key.
-				 */
-				strat = chosen->sk_strategy;
-				if (strat != BTEqualStrategyNumber)
-				{
-					strat_total = strat;
-					if (strat == BTGreaterStrategyNumber ||
-						strat == BTLessStrategyNumber)
-						break;
-				}
-
-				/*
-				 * Done if that was the last attribute, or if next key is not
-				 * in sequence (implying no boundary key is available for the
-				 * next attribute).
-				 */
-				if (i >= so->numberOfKeys ||
-					cur->sk_attno != curattr + 1)
-					break;
-
-				/*
-				 * Reset for next attr.
-				 */
-				curattr = cur->sk_attno;
-				chosen = NULL;
-				impliesNN = NULL;
-			}
-
-			/*
-			 * Can we use this key as a starting boundary for this attr?
-			 *
-			 * If not, does it imply a NOT NULL constraint?  (Because
-			 * SK_SEARCHNULL keys are always assigned BTEqualStrategyNumber,
-			 * *any* inequality key works for that; we need not test.)
-			 */
-			switch (cur->sk_strategy)
-			{
-				case BTLessStrategyNumber:
-				case BTLessEqualStrategyNumber:
-					if (chosen == NULL)
-					{
-						if (ScanDirectionIsBackward(dir))
-							chosen = cur;
-						else
-							impliesNN = cur;
-					}
-					break;
-				case BTEqualStrategyNumber:
-					/* override any non-equality choice */
-					chosen = cur;
-					break;
-				case BTGreaterEqualStrategyNumber:
-				case BTGreaterStrategyNumber:
-					if (chosen == NULL)
-					{
-						if (ScanDirectionIsForward(dir))
-							chosen = cur;
-						else
-							impliesNN = cur;
-					}
-					break;
-			}
-		}
-	}
+	keysCount = _bt_choose_scan_keys(so->keyData, so->numberOfKeys, prefixDir, startKeys, notnullkeys, &strat_total, 0);
 
 	/*
 	 * If we found no usable boundary keys, we have to start from one end of
@@ -1088,260 +911,112 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	{
 		bool		match;
 
-		match = _bt_endpoint(scan, dir);
-
-		if (!match)
+		if (!_bt_skip_enabled(so))
 		{
-			/* No match, so mark (parallel) scan finished */
-			_bt_parallel_done(scan);
-		}
+			match = _bt_endpoint(scan, prefixDir);
 
-		return match;
-	}
+			if (!match)
+			{
+				/* No match, so mark (parallel) scan finished */
+				_bt_parallel_done(scan);
+			}
 
-	/*
-	 * We want to start the scan somewhere within the index.  Set up an
-	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built using the keys
-	 * identified by startKeys[].  (Remaining insertion scankey fields are
-	 * initialized after initial-positioning strategy is finalized.)
-	 */
-	Assert(keysCount <= INDEX_MAX_KEYS);
-	for (i = 0; i < keysCount; i++)
-	{
-		ScanKey		cur = startKeys[i];
+			return match;
+		}
+		else
+		{
+			Relation	rel = scan->indexRelation;
+			Buffer		buf;
+			Page		page;
+			BTPageOpaque opaque;
+			OffsetNumber start;
+			BTSkipCompareResult cmp = {0};
 
-		Assert(cur->sk_attno == i + 1);
+			_bt_skip_create_scankeys(rel, so);
 
-		if (cur->sk_flags & SK_ROW_HEADER)
-		{
 			/*
-			 * Row comparison header: look to the first row member instead.
-			 *
-			 * The member scankeys are already in insertion format (ie, they
-			 * have sk_func = 3-way-comparison function), but we have to watch
-			 * out for nulls, which _bt_preprocess_keys didn't check. A null
-			 * in the first row member makes the condition unmatchable, just
-			 * like qual_ok = false.
+			 * Scan down to the leftmost or rightmost leaf page and position
+			 * the scan on the leftmost or rightmost item on that page.
+			 * Start the skip scan from there to find the first matching item
 			 */
-			ScanKey		subkey = (ScanKey) DatumGetPointer(cur->sk_argument);
+			buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(prefixDir), scan->xs_snapshot);
 
-			Assert(subkey->sk_flags & SK_ROW_MEMBER);
-			if (subkey->sk_flags & SK_ISNULL)
+			if (!BufferIsValid(buf))
 			{
-				_bt_parallel_done(scan);
+				/*
+				 * Empty index. Lock the whole relation, as nothing finer to lock
+				 * exists.
+				 */
+				PredicateLockRelation(rel, scan->xs_snapshot);
+				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
-			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
 
-			/*
-			 * If the row comparison is the last positioning key we accepted,
-			 * try to add additional keys from the lower-order row members.
-			 * (If we accepted independent conditions on additional index
-			 * columns, we use those instead --- doesn't seem worth trying to
-			 * determine which is more restrictive.)  Note that this is OK
-			 * even if the row comparison is of ">" or "<" type, because the
-			 * condition applied to all but the last row member is effectively
-			 * ">=" or "<=", and so the extra keys don't break the positioning
-			 * scheme.  But, by the same token, if we aren't able to use all
-			 * the row members, then the part of the row comparison that we
-			 * did use has to be treated as just a ">=" or "<=" condition, and
-			 * so we'd better adjust strat_total accordingly.
-			 */
-			if (i == keysCount - 1)
+			PredicateLockPage(rel, BufferGetBlockNumber(buf), scan->xs_snapshot);
+			page = BufferGetPage(buf);
+			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			Assert(P_ISLEAF(opaque));
+
+			if (ScanDirectionIsForward(prefixDir))
 			{
-				bool		used_all_subkeys = false;
+				/* There could be dead pages to the left, so not this: */
+				/* Assert(P_LEFTMOST(opaque)); */
 
-				Assert(!(subkey->sk_flags & SK_ROW_END));
-				for (;;)
-				{
-					subkey++;
-					Assert(subkey->sk_flags & SK_ROW_MEMBER);
-					if (subkey->sk_attno != keysCount + 1)
-						break;	/* out-of-sequence, can't use it */
-					if (subkey->sk_strategy != cur->sk_strategy)
-						break;	/* wrong direction, can't use it */
-					if (subkey->sk_flags & SK_ISNULL)
-						break;	/* can't use null keys */
-					Assert(keysCount < INDEX_MAX_KEYS);
-					memcpy(inskey.scankeys + keysCount, subkey,
-						   sizeof(ScanKeyData));
-					keysCount++;
-					if (subkey->sk_flags & SK_ROW_END)
-					{
-						used_all_subkeys = true;
-						break;
-					}
-				}
-				if (!used_all_subkeys)
-				{
-					switch (strat_total)
-					{
-						case BTLessStrategyNumber:
-							strat_total = BTLessEqualStrategyNumber;
-							break;
-						case BTGreaterStrategyNumber:
-							strat_total = BTGreaterEqualStrategyNumber;
-							break;
-					}
-				}
-				break;			/* done with outer loop */
+				start = P_FIRSTDATAKEY(opaque);
 			}
-		}
-		else
-		{
-			/*
-			 * Ordinary comparison key.  Transform the search-style scan key
-			 * to an insertion scan key by replacing the sk_func with the
-			 * appropriate btree comparison function.
-			 *
-			 * If scankey operator is not a cross-type comparison, we can use
-			 * the cached comparison function; otherwise gotta look it up in
-			 * the catalogs.  (That can't lead to infinite recursion, since no
-			 * indexscan initiated by syscache lookup will use cross-data-type
-			 * operators.)
-			 *
-			 * We support the convention that sk_subtype == InvalidOid means
-			 * the opclass input type; this is a hack to simplify life for
-			 * ScanKeyInit().
-			 */
-			if (cur->sk_subtype == rel->rd_opcintype[i] ||
-				cur->sk_subtype == InvalidOid)
+			else if (ScanDirectionIsBackward(prefixDir))
 			{
-				FmgrInfo   *procinfo;
-
-				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
-				ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
-											   cur->sk_flags,
-											   cur->sk_attno,
-											   InvalidStrategy,
-											   cur->sk_subtype,
-											   cur->sk_collation,
-											   procinfo,
-											   cur->sk_argument);
+				Assert(P_RIGHTMOST(opaque));
+
+				start = PageGetMaxOffsetNumber(page);
 			}
 			else
 			{
-				RegProcedure cmp_proc;
-
-				cmp_proc = get_opfamily_proc(rel->rd_opfamily[i],
-											 rel->rd_opcintype[i],
-											 cur->sk_subtype,
-											 BTORDER_PROC);
-				if (!RegProcedureIsValid(cmp_proc))
-					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
-						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
-						 cur->sk_attno, RelationGetRelationName(rel));
-				ScanKeyEntryInitialize(inskey.scankeys + i,
-									   cur->sk_flags,
-									   cur->sk_attno,
-									   InvalidStrategy,
-									   cur->sk_subtype,
-									   cur->sk_collation,
-									   cmp_proc,
-									   cur->sk_argument);
+				elog(ERROR, "invalid scan direction: %d", (int) prefixDir);
 			}
-		}
-	}
 
-	/*----------
-	 * Examine the selected initial-positioning strategy to determine exactly
-	 * where we need to start the scan, and set flag variables to control the
-	 * code below.
-	 *
-	 * If nextkey = false, _bt_search and _bt_binsrch will locate the first
-	 * item >= scan key.  If nextkey = true, they will locate the first
-	 * item > scan key.
-	 *
-	 * If goback = true, we will then step back one item, while if
-	 * goback = false, we will start the scan on the located item.
-	 *----------
-	 */
-	switch (strat_total)
-	{
-		case BTLessStrategyNumber:
-
-			/*
-			 * Find first item >= scankey, then back up one to arrive at last
-			 * item < scankey.  (Note: this positioning strategy is only used
-			 * for a backward scan, so that is always the correct starting
-			 * position.)
-			 */
-			nextkey = false;
-			goback = true;
-			break;
-
-		case BTLessEqualStrategyNumber:
-
-			/*
-			 * Find first item > scankey, then back up one to arrive at last
-			 * item <= scankey.  (Note: this positioning strategy is only used
-			 * for a backward scan, so that is always the correct starting
-			 * position.)
-			 */
-			nextkey = true;
-			goback = true;
-			break;
-
-		case BTEqualStrategyNumber:
-
-			/*
-			 * If a backward scan was specified, need to start with last equal
-			 * item not first one.
+			/* remember which buffer we have pinned */
+			so->currPos.buf = buf;
+			so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+			itup = _bt_get_tuple_from_offset(so, start);
+			/* in some cases, we can (or have to) skip further inside the prefix.
+			 * we can do this if we have extra quals becoming available, eg.
+			 * WHERE b=2 on an index on (a,b).
+			 * We must, if this is not regular mode (prefixDir!=postfixDir).
+			 * Because this means we're at the end of the prefix, while we should be
+			 * at the beginning.
 			 */
-			if (ScanDirectionIsBackward(dir))
+			if (_bt_has_extra_quals_after_skip(so->skipData, postfixDir, 0) ||
+					!_bt_skip_is_regular_mode(prefixDir, postfixDir))
 			{
-				/*
-				 * This is the same as the <= strategy.  We will check at the
-				 * end whether the found item is actually =.
-				 */
-				nextkey = true;
-				goback = true;
+				_bt_skip_extra_conditions(scan, &itup, &start, prefixDir, postfixDir, &cmp);
 			}
-			else
+			/* now find the next matching tuple */
+			match = _bt_skip_find_next(scan, itup, start, prefixDir, postfixDir);
+			if (!match)
 			{
-				/*
-				 * This is the same as the >= strategy.  We will check at the
-				 * end whether the found item is actually =.
-				 */
-				nextkey = false;
-				goback = false;
+				if (_bt_skip_is_always_valid(so))
+					_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+				return false;
 			}
-			break;
 
-		case BTGreaterEqualStrategyNumber:
+			_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
-			/*
-			 * Find first item >= scankey.  (This is only used for forward
-			 * scans.)
-			 */
-			nextkey = false;
-			goback = false;
-			break;
-
-		case BTGreaterStrategyNumber:
-
-			/*
-			 * Find first item > scankey.  (This is only used for forward
-			 * scans.)
-			 */
-			nextkey = true;
-			goback = false;
-			break;
+			currItem = &so->currPos.items[so->currPos.itemIndex];
+			scan->xs_heaptid = currItem->heapTid;
+			if (scan->xs_want_itup)
+				scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
 
-		default:
-			/* can't get here, but keep compiler quiet */
-			elog(ERROR, "unrecognized strat_total: %d", (int) strat_total);
-			return false;
+			return true;
+		}
 	}
 
-	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
-	inskey.anynullkeys = false; /* unused */
-	inskey.nextkey = nextkey;
-	inskey.pivotsearch = false;
-	inskey.scantid = NULL;
-	inskey.keysz = keysCount;
+	if (!_bt_create_insertion_scan_key(rel, prefixDir, startKeys, keysCount, &inskey, &strat_total,  &goback))
+	{
+		_bt_parallel_done(scan);
+		return false;
+	}
 
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
@@ -1373,7 +1048,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		PredicateLockPage(rel, BufferGetBlockNumber(buf),
 						  scan->xs_snapshot);
 
-	_bt_initialize_more_data(so, dir);
+	_bt_initialize_more_data(so, prefixDir);
 
 	/* position to the precise item on the page */
 	offnum = _bt_binsrch(rel, &inskey, buf);
@@ -1403,23 +1078,81 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	Assert(!BTScanPosIsValid(so->currPos));
 	so->currPos.buf = buf;
 
-	/*
-	 * Now load data from the first page of the scan.
-	 */
-	if (!_bt_readpage(scan, dir, offnum))
+	if (_bt_skip_enabled(so))
 	{
-		/*
-		 * There's no actually-matching data on this page.  Try to advance to
-		 * the next page.  Return false if there's no matching data at all.
+		Page page;
+		BTPageOpaque opaque;
+		OffsetNumber minoff;
+		bool match;
+		BTSkipCompareResult cmp = {0};
+
+		/* first create the skip scan keys */
+		_bt_skip_create_scankeys(rel, so);
+
+		/* remember which page we have pinned */
+		so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+		page = BufferGetPage(so->currPos.buf);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		minoff = P_FIRSTDATAKEY(opaque);
+		/* _binsrch + goback parameter can leave the offnum before the first item on the page
+		 * or after the last item on the page. if that is the case we need to either step
+		 * back or forward one page
 		 */
-		_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
-		if (!_bt_steppage(scan, dir))
+		if (offnum < minoff)
+		{
+			_bt_unlockbuf(rel, so->currPos.buf);
+			if (!_bt_step_back_page(scan, &itup, &offnum))
+				return false;
+			page = BufferGetPage(so->currPos.buf);
+		}
+		else if (offnum > PageGetMaxOffsetNumber(page))
+		{
+			BlockNumber next = opaque->btpo_next;
+			_bt_unlockbuf(rel, so->currPos.buf);
+			if (!_bt_step_forward_page(scan, next, &itup, &offnum))
+				return false;
+			page = BufferGetPage(so->currPos.buf);
+		}
+
+		itup = _bt_get_tuple_from_offset(so, offnum);
+		/* check if we can skip even more because we can use new conditions */
+		if (_bt_has_extra_quals_after_skip(so->skipData, postfixDir, inskey.keysz) ||
+				!_bt_skip_is_regular_mode(prefixDir, postfixDir))
+		{
+			_bt_skip_extra_conditions(scan, &itup, &offnum, prefixDir, postfixDir, &cmp);
+		}
+		/* now find the tuple */
+		match = _bt_skip_find_next(scan, itup, offnum, prefixDir, postfixDir);
+		if (!match)
+		{
+			if (_bt_skip_is_always_valid(so))
+				_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 			return false;
+		}
+
+		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 	else
 	{
-		/* Drop the lock, and maybe the pin, on the current page */
-		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+		/*
+		 * Now load data from the first page of the scan.
+		 */
+		if (!_bt_readpage(scan, prefixDir, &offnum, true))
+		{
+			/*
+			 * There's no actually-matching data on this page.  Try to advance to
+			 * the next page.  Return false if there's no matching data at all.
+			 */
+			_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+			if (!_bt_steppage(scan, prefixDir))
+				return false;
+		}
+		else
+		{
+			/* Drop the lock, and maybe the pin, on the current page */
+			_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+		}
 	}
 
 readcomplete:
@@ -1447,29 +1180,113 @@ readcomplete:
  *		so->currPos.buf to InvalidBuffer.
  */
 bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
+_bt_next(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTScanPosItem *currItem;
 
-	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
-	 */
-	if (ScanDirectionIsForward(dir))
+	if (!_bt_skip_enabled(so))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		/*
+		 * Advance to next tuple on current page; or if there's no more, try to
+		 * step to the next page with data.
+		 */
+		if (ScanDirectionIsForward(prefixDir))
 		{
-			if (!_bt_steppage(scan, dir))
-				return false;
+			if (++so->currPos.itemIndex > so->currPos.lastItem)
+			{
+				if (!_bt_steppage(scan, prefixDir))
+					return false;
+			}
+		}
+		else
+		{
+			if (--so->currPos.itemIndex < so->currPos.firstItem)
+			{
+				if (!_bt_steppage(scan, prefixDir))
+					return false;
+			}
 		}
 	}
 	else
 	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		bool match;
+		IndexTuple itup = NULL;
+		OffsetNumber offnum = InvalidOffsetNumber;
+
+		if (ScanDirectionIsForward(postfixDir))
 		{
-			if (!_bt_steppage(scan, dir))
-				return false;
+			if (++so->currPos.itemIndex > so->currPos.lastItem)
+			{
+				if (prefixDir != so->skipData->curPos.nextDirection)
+				{
+					/* this happens when doing a cursor scan and changing
+					 * direction in the meantime. eg. first fetch forwards,
+					 * then backwards.
+					 * we *always* just go to the next page instead of skipping,
+					 * because that's the only safe option.
+					 */
+					so->skipData->curPos.nextAction = SkipStateNext;
+					so->skipData->curPos.nextDirection = prefixDir;
+				}
+
+				if (so->skipData->curPos.nextAction == SkipStateNext)
+				{
+					/* we should just go forwards one page, no skipping is necessary */
+					if (!_bt_step_forward_page(scan, so->currPos.nextPage, &itup, &offnum))
+						return false;
+				}
+				else if (so->skipData->curPos.nextAction == SkipStateStop)
+				{
+					/* we've reached the end of the index, or we cannot find any more keys */
+					BTScanPosUnpinIfPinned(so->currPos);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+
+				/* now find the next tuple */
+				match = _bt_skip_find_next(scan, itup, offnum, prefixDir, postfixDir);
+				if (!match)
+				{
+					if (_bt_skip_is_always_valid(so))
+						_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+					return false;
+				}
+				_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+			}
+		}
+		else
+		{
+			if (--so->currPos.itemIndex < so->currPos.firstItem)
+			{
+				if (prefixDir != so->skipData->curPos.nextDirection)
+				{
+					so->skipData->curPos.nextAction = SkipStateNext;
+					so->skipData->curPos.nextDirection = prefixDir;
+				}
+
+				if (so->skipData->curPos.nextAction == SkipStateNext)
+				{
+					if (!_bt_step_back_page(scan, &itup, &offnum))
+						return false;
+				}
+				else if (so->skipData->curPos.nextAction == SkipStateStop)
+				{
+					BTScanPosUnpinIfPinned(so->currPos);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+
+				/* now find the next tuple */
+				match = _bt_skip_find_next(scan, itup, offnum, prefixDir, postfixDir);
+				if (!match)
+				{
+					if (_bt_skip_is_always_valid(so))
+						_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+					return false;
+				}
+				_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+			}
 		}
 	}
 
@@ -1501,8 +1318,8 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  *
  * Returns true if any matching items found on the page, false if none.
  */
-static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
+bool
+_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber *offnum, bool isRegularMode)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
@@ -1512,6 +1329,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	int			itemIndex;
 	bool		continuescan;
 	int			indnatts;
+	int			prefixskipindex;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1570,11 +1388,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
-		offnum = Max(offnum, minoff);
+		*offnum = Max(*offnum, minoff);
 
-		while (offnum <= maxoff)
+		while (*offnum <= maxoff)
 		{
-			ItemId		iid = PageGetItemId(page, offnum);
+			ItemId		iid = PageGetItemId(page, *offnum);
 			IndexTuple	itup;
 
 			/*
@@ -1583,19 +1401,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
-				offnum = OffsetNumberNext(offnum);
+				*offnum = OffsetNumberNext(*offnum);
 				continue;
 			}
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+			if (_bt_checkkeys_extended(scan, itup, indnatts, dir, isRegularMode, &continuescan, &prefixskipindex))
 			{
 				/* tuple passes all scan key conditions */
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(so, itemIndex, *offnum, itup);
 					itemIndex++;
 				}
 				else
@@ -1607,26 +1425,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 					 * TID
 					 */
 					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+						_bt_setuppostingitems(so, itemIndex, *offnum,
 											  BTreeTupleGetPostingN(itup, 0),
 											  itup);
 					itemIndex++;
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(so, itemIndex, *offnum,
 											BTreeTupleGetPostingN(itup, i),
 											tupleOffset);
 						itemIndex++;
 					}
 				}
 			}
+
+			*offnum = OffsetNumberNext(*offnum);
+
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
 				break;
-
-			offnum = OffsetNumberNext(offnum);
+			if (!isRegularMode && prefixskipindex != -1)
+				break;
 		}
+		*offnum = OffsetNumberPrev(*offnum);
 
 		/*
 		 * We don't need to visit page to the right when the high key
@@ -1646,7 +1468,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			int			truncatt;
 
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, NULL);
 		}
 
 		if (!continuescan)
@@ -1662,11 +1484,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
-		offnum = Min(offnum, maxoff);
+		*offnum = Min(*offnum, maxoff);
 
-		while (offnum >= minoff)
+		while (*offnum >= minoff)
 		{
-			ItemId		iid = PageGetItemId(page, offnum);
+			ItemId		iid = PageGetItemId(page, *offnum);
 			IndexTuple	itup;
 			bool		tuple_alive;
 			bool		passes_quals;
@@ -1683,10 +1505,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
-				Assert(offnum >= P_FIRSTDATAKEY(opaque));
-				if (offnum > P_FIRSTDATAKEY(opaque))
+				Assert(*offnum >= P_FIRSTDATAKEY(opaque));
+				if (*offnum > P_FIRSTDATAKEY(opaque))
 				{
-					offnum = OffsetNumberPrev(offnum);
+					*offnum = OffsetNumberPrev(*offnum);
 					continue;
 				}
 
@@ -1697,8 +1519,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan);
+			passes_quals = _bt_checkkeys_extended(scan, itup, indnatts, dir,
+												  isRegularMode, &continuescan, &prefixskipindex);
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
@@ -1706,7 +1528,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(so, itemIndex, *offnum, itup);
 				}
 				else
 				{
@@ -1724,28 +1546,32 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 					 */
 					itemIndex--;
 					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+						_bt_setuppostingitems(so, itemIndex, *offnum,
 											  BTreeTupleGetPostingN(itup, 0),
 											  itup);
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(so, itemIndex, *offnum,
 											BTreeTupleGetPostingN(itup, i),
 											tupleOffset);
 					}
 				}
 			}
+
+			*offnum = OffsetNumberPrev(*offnum);
+
 			if (!continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
 				break;
 			}
-
-			offnum = OffsetNumberPrev(offnum);
+			if (!isRegularMode && prefixskipindex != -1)
+				break;
 		}
+		*offnum = OffsetNumberNext(*offnum);
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
@@ -1853,7 +1679,7 @@ _bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
  * read lock, on that page.  If we do not hold the pin, we set so->currPos.buf
  * to InvalidBuffer.  We return true to indicate success.
  */
-static bool
+bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -1881,6 +1707,9 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		if (so->markTuples)
 			memcpy(so->markTuples, so->currTuples,
 				   so->currPos.nextTupleOffset);
+		if (so->skipData)
+			memcpy(&so->skipData->markPos, &so->skipData->curPos,
+				   sizeof(BTSkipPosData));
 		so->markPos.itemIndex = so->markItemIndex;
 		so->markItemIndex = -1;
 	}
@@ -1960,13 +1789,14 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * If there are no more matching records in the given direction, we drop all
  * locks and pins, set so->currPos.buf to InvalidBuffer, and return false.
  */
-static bool
+bool
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel;
 	Page		page;
 	BTPageOpaque opaque;
+	OffsetNumber offnum;
 	bool		status = true;
 
 	rel = scan->indexRelation;
@@ -1998,7 +1828,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+				offnum = P_FIRSTDATAKEY(opaque);
+				if (_bt_readpage(scan, dir, &offnum, true))
 					break;
 			}
 			else if (scan->parallel_scan != NULL)
@@ -2100,7 +1931,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+				offnum = PageGetMaxOffsetNumber(page);
+				if (_bt_readpage(scan, dir, &offnum, true))
 					break;
 			}
 			else if (scan->parallel_scan != NULL)
@@ -2168,7 +2000,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * to be half-dead; the caller should check that condition and step left
  * again if it's important.
  */
-static Buffer
+Buffer
 _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 {
 	Page		page;
@@ -2432,7 +2264,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readpage(scan, dir, start))
+	if (!_bt_readpage(scan, dir, &start, true))
 	{
 		/*
 		 * There's no actually-matching data on this page.  Try to advance to
@@ -2461,7 +2293,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
  * _bt_initialize_more_data() -- initialize moreLeft/moreRight appropriately
  * for scan direction
  */
-static inline void
+inline void
 _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 {
 	/* initialize moreLeft/moreRight appropriately for scan direction */
@@ -2478,3 +2310,25 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 	so->numKilled = 0;			/* just paranoia */
 	so->markItemIndex = -1;		/* ditto */
 }
+
+/* Forward the call to either _bt_checkkeys, which is a simple
+ * and fastest way of checking keys, or to _bt_checkkeys_skip,
+ * which is a slower way to check the keys, but it will return extra
+ * information about whether or not we should stop reading the current page
+ * and skip. The expensive checking is only necessary when !isRegularMode, eg.
+ * when prefixDir!=postfixDir, which only happens when scanning from cursors backwards
+ */
+static inline bool
+_bt_checkkeys_extended(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+					   ScanDirection dir, bool isRegularMode,
+					   bool *continuescan, int *prefixskipindex)
+{
+	if (isRegularMode)
+	{
+		return _bt_checkkeys(scan, tuple, tupnatts, dir, continuescan, prefixskipindex);
+	}
+	else
+	{
+		return _bt_checkkeys_skip(scan, tuple, tupnatts, dir, continuescan, prefixskipindex);
+	}
+}
diff --git a/src/backend/access/nbtree/nbtskip.c b/src/backend/access/nbtree/nbtskip.c
new file mode 100644
index 0000000000..1be6bf62eb
--- /dev/null
+++ b/src/backend/access/nbtree/nbtskip.c
@@ -0,0 +1,1332 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtskip.c
+ *	  Search code related to skip scan for postgres btrees.
+ *
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtskip.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/relscan.h"
+#include "catalog/catalog.h"
+#include "miscadmin.h"
+#include "utils/guc.h"
+#include "storage/predicate.h"
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+
+static inline void _bt_update_scankey_with_tuple(BTScanInsert scankeys,
+											Relation indexRel, IndexTuple itup, int numattrs);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, Buffer buf);
+static inline int32 _bt_compare_until(Relation rel, BTScanInsert key, IndexTuple itup, int prefix);
+static inline void
+_bt_determine_next_action(IndexScanDesc scan, BTSkipCompareResult *cmp, OffsetNumber firstOffnum,
+						  OffsetNumber lastOffnum, ScanDirection postfixDir, BTSkipState *nextAction);
+static inline void
+_bt_determine_next_action_after_skip(BTScanOpaque so, BTSkipCompareResult *cmp, ScanDirection prefixDir,
+									 ScanDirection postfixDir, int skipped, BTSkipState *nextAction);
+static inline void
+_bt_determine_next_action_after_skip_extra(BTScanOpaque so, BTSkipCompareResult *cmp, BTSkipState *nextAction);
+static inline void _bt_copy_scankey(BTScanInsert to, BTScanInsert from, int numattrs);
+static inline IndexTuple _bt_get_tuple_from_offset_with_copy(BTScanOpaque so, OffsetNumber curTupleOffnum);
+
+static void _bt_skip_update_scankey_after_read(IndexScanDesc scan, IndexTuple curTuple,
+											   ScanDirection prefixDir, ScanDirection postfixDir);
+static void _bt_skip_update_scankey_for_prefix_skip(IndexScanDesc scan, Relation indexRel,
+										int prefix, IndexTuple itup, ScanDirection prefixDir);
+static bool _bt_try_in_page_skip(IndexScanDesc scan, ScanDirection prefixDir);
+
+/*
+ * returns whether we're at the end of a scan.
+ * the scan position can be invalid even though we still
+ * should continue the scan. this happens for example when
+ * we're scanning with prefixDir!=postfixDir. when looking at the first
+ * prefix, we traverse the items within the prefix from max to min.
+ * if none of them match, we actually run off the start of the index,
+ * meaning none of the tuples within this prefix match. the scan pos becomes
+ * invalid, however, we do need to look further to the next prefix.
+ * therefore, this function still returns true in this particular case.
+ */
+static inline bool
+_bt_skip_is_valid(BTScanOpaque so, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	return BTScanPosIsValid(so->currPos) ||
+			(!_bt_skip_is_regular_mode(prefixDir, postfixDir) &&
+			 so->skipData->curPos.nextAction != SkipStateStop);
+}
+
+/* try finding the next tuple to skip to within the local tuple storage.
+ * local tuple storage is filled during _bt_readpage with all matching
+ * tuples on that page. if we can find the next prefix here it saves
+ * us doing a scan from root.
+ * Note that this optimization only works with _bt_regular_mode == true
+ * If this is not the case, the local tuple workspace will always only
+ * contain tuples of one specific prefix (_bt_readpage will stop at
+ * the end of a prefx)
+ */
+static bool
+_bt_try_in_page_skip(IndexScanDesc scan, ScanDirection prefixDir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPosItem *currItem;
+	BTSkip skip = so->skipData;
+	IndexTuple itup = NULL;
+	bool goback;
+	int low, high, starthigh, startlow;
+	int32		result,
+				cmpval;
+	BTScanInsert key = &so->skipData->curPos.skipScanKey;
+
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+	_bt_skip_update_scankey_for_prefix_skip(scan, scan->indexRelation, skip->prefix, itup, prefixDir);
+
+	_bt_set_bsearch_flags(key->scankeys[key->keysz - 1].sk_strategy, prefixDir, &key->nextkey, &goback);
+
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+	/* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
+
+	startlow = low = ScanDirectionIsForward(prefixDir) ? so->currPos.itemIndex + 1 : so->currPos.firstItem;
+	starthigh = high = ScanDirectionIsForward(prefixDir) ? so->currPos.lastItem : so->currPos.itemIndex - 1;
+
+	/*
+	 * If there are no keys on the page, return the first available slot. Note
+	 * this covers two cases: the page is really empty (no keys), or it
+	 * contains only a high key.  The latter case is possible after vacuuming.
+	 * This can never happen on an internal page, however, since they are
+	 * never empty (an internal page must have children).
+	 */
+	if (unlikely(high < low))
+		return false;
+
+	/*
+	 * Binary search to find the first key on the page >= scan key, or first
+	 * key > scankey when nextkey is true.
+	 *
+	 * For nextkey=false (cmpval=1), the loop invariant is: all slots before
+	 * 'low' are < scan key, all slots at or after 'high' are >= scan key.
+	 *
+	 * For nextkey=true (cmpval=0), the loop invariant is: all slots before
+	 * 'low' are <= scan key, all slots at or after 'high' are > scan key.
+	 *
+	 * We can fall out when high == low.
+	 */
+	high++;						/* establish the loop invariant for high */
+
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
+
+	while (high > low)
+	{
+		int mid = low + ((high - low) / 2);
+
+		/* We have low <= mid < high, so mid points at a real slot */
+
+		currItem = &so->currPos.items[mid];
+		itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+		result = _bt_compare_until(scan->indexRelation, key, itup, skip->prefix);
+
+		if (result >= cmpval)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	if (high > starthigh)
+		return false;
+
+	if (goback)
+	{
+		low--;
+		if (low < startlow)
+			return false;
+	}
+
+	so->currPos.itemIndex = low;
+
+	return true;
+}
+
+/*
+ *  _bt_skip() -- Skip items that have the same prefix as the most recently
+ * 				  fetched index tuple.
+ *
+ * in: pinned, not locked
+ * out: pinned, not locked (unless end of scan, then unpinned)
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPosItem *currItem;
+	IndexTuple itup = NULL;
+	OffsetNumber curTupleOffnum = InvalidOffsetNumber;
+	BTSkipCompareResult cmp;
+	BTSkip skip = so->skipData;
+	OffsetNumber first;
+
+	/* in page skip only works when prefixDir == postfixDir */
+	if (!_bt_skip_is_regular_mode(prefixDir, postfixDir) || !_bt_try_in_page_skip(scan, prefixDir))
+	{
+		currItem = &so->currPos.items[so->currPos.itemIndex];
+		itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+		so->skipData->curPos.nextSkipIndex = so->skipData->prefix;
+		_bt_skip_once(scan, &itup, &curTupleOffnum, true, prefixDir, postfixDir);
+		_bt_skip_until_match(scan, &itup, &curTupleOffnum, prefixDir, postfixDir);
+		if (!_bt_skip_is_always_valid(so))
+			return false;
+
+		first = curTupleOffnum;
+		_bt_readpage(scan, postfixDir, &curTupleOffnum, _bt_skip_is_regular_mode(prefixDir, postfixDir));
+		if (DEBUG1 >= log_min_messages || DEBUG1 >= client_min_messages)
+		{
+			print_itup(BufferGetBlockNumber(so->currPos.buf), _bt_get_tuple_from_offset(so, first), NULL, scan->indexRelation,
+						"first item on page compared after skip");
+			print_itup(BufferGetBlockNumber(so->currPos.buf), _bt_get_tuple_from_offset(so, curTupleOffnum), NULL, scan->indexRelation,
+						"last item on page compared after skip");
+		}
+		_bt_compare_current_item(scan, _bt_get_tuple_from_offset(so, curTupleOffnum),
+								 IndexRelationGetNumberOfAttributes(scan->indexRelation),
+								 postfixDir, _bt_skip_is_regular_mode(prefixDir, postfixDir), &cmp);
+		_bt_determine_next_action(scan, &cmp, first, curTupleOffnum, postfixDir, &skip->curPos.nextAction);
+		skip->curPos.nextDirection = prefixDir;
+		skip->curPos.nextSkipIndex = cmp.prefixSkipIndex;
+		_bt_skip_update_scankey_after_read(scan, _bt_get_tuple_from_offset(so, curTupleOffnum), prefixDir, postfixDir);
+
+		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+	}
+
+	/* prepare for the call to _bt_next, because _bt_next increments this to get to the tuple we want to be at */
+	if (ScanDirectionIsForward(postfixDir))
+		so->currPos.itemIndex--;
+	else
+		so->currPos.itemIndex++;
+
+	return true;
+}
+
+IndexTuple
+_bt_get_tuple_from_offset(BTScanOpaque so, OffsetNumber curTupleOffnum)
+{
+	Page page = BufferGetPage(so->currPos.buf);
+	return (IndexTuple) PageGetItem(page, PageGetItemId(page, curTupleOffnum));
+}
+
+static IndexTuple
+_bt_get_tuple_from_offset_with_copy(BTScanOpaque so, OffsetNumber curTupleOffnum)
+{
+	Page page = BufferGetPage(so->currPos.buf);
+	IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, curTupleOffnum));
+	Size		itupsz = IndexTupleSize(itup);
+	memcpy(so->skipData->curPos.skipTuple, itup, itupsz);
+
+	return (IndexTuple) so->skipData->curPos.skipTuple;
+}
+
+static void
+_bt_determine_next_action(IndexScanDesc scan, BTSkipCompareResult *cmp, OffsetNumber firstOffnum, OffsetNumber lastOffnum, ScanDirection postfixDir, BTSkipState *nextAction)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	if (cmp->fullKeySkip)
+		*nextAction = SkipStateStop;
+	else if (ScanDirectionIsForward(postfixDir))
+	{
+		OffsetNumber firstItem = firstOffnum, lastItem = lastOffnum;
+		if (cmp->prefixSkip)
+		{
+			*nextAction = SkipStateSkip;
+		}
+		else
+		{
+			IndexTuple toCmp;
+			if (so->currPos.lastItem >= so->currPos.firstItem)
+				toCmp = _bt_get_tuple_from_offset_with_copy(so, so->currPos.items[so->currPos.lastItem].indexOffset);
+			else
+				toCmp = _bt_get_tuple_from_offset_with_copy(so, firstItem);
+			_bt_update_scankey_with_tuple(&so->skipData->currentTupleKey,
+										  scan->indexRelation, toCmp, RelationGetNumberOfAttributes(scan->indexRelation));
+			if (_bt_has_extra_quals_after_skip(so->skipData, postfixDir, so->skipData->prefix) && !cmp->equal &&
+					(cmp->prefixCmpResult != 0 ||
+					 _bt_compare_until(scan->indexRelation, &so->skipData->currentTupleKey,
+									   _bt_get_tuple_from_offset(so, lastItem), so->skipData->prefix) != 0))
+				*nextAction = SkipStateSkipExtra;
+			else
+				*nextAction = SkipStateNext;
+		}
+	}
+	else
+	{
+		OffsetNumber firstItem = lastOffnum, lastItem = firstOffnum;
+		if (cmp->prefixSkip)
+		{
+			*nextAction = SkipStateSkip;
+		}
+		else
+		{
+			IndexTuple toCmp;
+			if (so->currPos.lastItem >= so->currPos.firstItem)
+				toCmp = _bt_get_tuple_from_offset_with_copy(so, so->currPos.items[so->currPos.firstItem].indexOffset);
+			else
+				toCmp = _bt_get_tuple_from_offset_with_copy(so, lastItem);
+			_bt_update_scankey_with_tuple(&so->skipData->currentTupleKey,
+										  scan->indexRelation, toCmp, RelationGetNumberOfAttributes(scan->indexRelation));
+			if (_bt_has_extra_quals_after_skip(so->skipData, postfixDir, so->skipData->prefix) && !cmp->equal &&
+					(cmp->prefixCmpResult != 0 ||
+					 _bt_compare_until(scan->indexRelation, &so->skipData->currentTupleKey,
+									   _bt_get_tuple_from_offset(so, firstItem), so->skipData->prefix) != 0))
+				*nextAction = SkipStateSkipExtra;
+			else
+				*nextAction = SkipStateNext;
+		}
+	}
+}
+
+static inline bool
+_bt_should_prefix_skip(BTSkipCompareResult *cmp)
+{
+	return cmp->prefixSkip || cmp->prefixCmpResult != 0;
+}
+
+static inline void
+_bt_determine_next_action_after_skip(BTScanOpaque so, BTSkipCompareResult *cmp, ScanDirection prefixDir,
+									 ScanDirection postfixDir, int skipped, BTSkipState *nextAction)
+{
+	if (!_bt_skip_is_always_valid(so) || cmp->fullKeySkip)
+		*nextAction = SkipStateStop;
+	else if (cmp->equal && _bt_skip_is_regular_mode(prefixDir, postfixDir))
+		*nextAction = SkipStateNext;
+	else if (_bt_should_prefix_skip(cmp) && _bt_skip_is_regular_mode(prefixDir, postfixDir) &&
+			 ((ScanDirectionIsForward(prefixDir) && cmp->skCmpResult == -1) ||
+			  (ScanDirectionIsBackward(prefixDir) && cmp->skCmpResult == 1)))
+		*nextAction = SkipStateSkip;
+	else if (!_bt_skip_is_regular_mode(prefixDir, postfixDir) ||
+			 _bt_has_extra_quals_after_skip(so->skipData, postfixDir, skipped) ||
+			 cmp->prefixCmpResult != 0)
+		*nextAction = SkipStateSkipExtra;
+	else
+		*nextAction = SkipStateNext;
+}
+
+static inline void
+_bt_determine_next_action_after_skip_extra(BTScanOpaque so, BTSkipCompareResult *cmp, BTSkipState *nextAction)
+{
+	if (!_bt_skip_is_always_valid(so) || cmp->fullKeySkip)
+		*nextAction = SkipStateStop;
+	else if (cmp->equal)
+		*nextAction = SkipStateNext;
+	else if (_bt_should_prefix_skip(cmp))
+		*nextAction = SkipStateSkip;
+	else
+		*nextAction = SkipStateNext;
+}
+
+/* just a debug function that prints a scankey. will be removed for final patch */
+static inline void
+_print_skey(IndexScanDesc scan, BTScanInsert scanKey)
+{
+	Oid			typOutput;
+	bool		varlenatype;
+	char	   *val;
+	int i;
+	Relation rel = scan->indexRelation;
+
+	for (i = 0; i < scanKey->keysz; i++)
+	{
+		ScanKey cur = &scanKey->scankeys[i];
+		if (!IsCatalogRelation(rel))
+		{
+			if (!(cur->sk_flags & SK_ISNULL))
+			{
+				if (cur->sk_subtype != InvalidOid)
+					getTypeOutputInfo(cur->sk_subtype,
+									  &typOutput, &varlenatype);
+				else
+					getTypeOutputInfo(rel->rd_opcintype[i],
+									  &typOutput, &varlenatype);
+				val = OidOutputFunctionCall(typOutput, cur->sk_argument);
+				if (val)
+				{
+					elog(DEBUG1, "%s sk attr %d val: %s (%s, %s)",
+						 RelationGetRelationName(rel), i, val,
+						 (cur->sk_flags & SK_BT_NULLS_FIRST) != 0 ? "NULLS FIRST" : "NULLS LAST",
+						 (cur->sk_flags & SK_BT_DESC) != 0 ? "DESC" : "ASC");
+					pfree(val);
+				}
+			}
+			else
+			{
+				elog(DEBUG1, "%s sk attr %d val: NULL (%s, %s)",
+					 RelationGetRelationName(rel), i,
+					 (cur->sk_flags & SK_BT_NULLS_FIRST) != 0 ? "NULLS FIRST" : "NULLS LAST",
+					 (cur->sk_flags & SK_BT_DESC) != 0 ? "DESC" : "ASC");
+			}
+		}
+	}
+}
+
+bool
+_bt_checkkeys_skip(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+				   ScanDirection dir, bool *continuescan, int *prefixskipindex)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+
+	bool match = _bt_checkkeys(scan, tuple, tupnatts, dir, continuescan, prefixskipindex);
+	int prefixCmpResult = _bt_compare_until(scan->indexRelation, &skip->curPos.skipScanKey, tuple, skip->prefix);
+	if (*prefixskipindex == -1 && prefixCmpResult != 0)
+	{
+		*prefixskipindex = skip->prefix;
+		return false;
+	}
+	else
+	{
+		bool newcont;
+		_bt_checkkeys_threeway(scan, tuple, tupnatts, dir, &newcont, prefixskipindex);
+		if (*prefixskipindex == -1 && prefixCmpResult != 0)
+		{
+			*prefixskipindex = skip->prefix;
+			return false;
+		}
+	}
+	return match;
+}
+
+/*
+ * Compare a scankey with a given tuple but only the first prefix columns
+ * This function returns 0 if the first 'prefix' columns are equal
+ * -1 if key < itup for the first prefix columns
+ * 1 if key > itup for the first prefix columns
+ */
+int32
+_bt_compare_until(Relation rel,
+			BTScanInsert key,
+			IndexTuple itup,
+			int prefix)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ScanKey		scankey;
+	int			ncmpkey;
+
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+
+	ncmpkey = Min(prefix, key->keysz);
+	scankey = key->scankeys;
+	for (int i = 1; i <= ncmpkey; i++)
+	{
+		Datum		datum;
+		bool		isNull;
+		int32		result;
+
+		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+
+		/* see comments about NULLs handling in btbuild */
+		if (scankey->sk_flags & SK_ISNULL)	/* key is NULL */
+		{
+			if (isNull)
+				result = 0;		/* NULL "=" NULL */
+			else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+				result = -1;	/* NULL "<" NOT_NULL */
+			else
+				result = 1;		/* NULL ">" NOT_NULL */
+		}
+		else if (isNull)		/* key is NOT_NULL and item is NULL */
+		{
+			if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+				result = 1;		/* NOT_NULL ">" NULL */
+			else
+				result = -1;	/* NOT_NULL "<" NULL */
+		}
+		else
+		{
+			/*
+			 * The sk_func needs to be passed the index value as left arg and
+			 * the sk_argument as right arg (they might be of different
+			 * types).  Since it is convenient for callers to think of
+			 * _bt_compare as comparing the scankey to the index item, we have
+			 * to flip the sign of the comparison result.  (Unless it's a DESC
+			 * column, in which case we *don't* flip the sign.)
+			 */
+			result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+													 scankey->sk_collation,
+													 datum,
+													 scankey->sk_argument));
+
+			if (!(scankey->sk_flags & SK_BT_DESC))
+				INVERT_COMPARE_RESULT(result);
+		}
+
+		/* if the keys are unequal, return the difference */
+		if (result != 0)
+			return result;
+
+		scankey++;
+	}
+	return 0;
+}
+
+
+/*
+ * Create initial scankeys for skipping and stores them in the skipData
+ * structure
+ */
+void
+_bt_skip_create_scankeys(Relation rel, BTScanOpaque so)
+{
+	int keysCount;
+	BTSkip skip = so->skipData;
+	StrategyNumber stratTotal;
+	ScanKey		keyPointers[INDEX_MAX_KEYS];
+	bool goback;
+	/* we need to create both forward and backward keys because the scan direction
+	 * may change at any moment in scans with a cursor.
+	 * we could technically delay creation of the second until first use as an optimization
+	 * but that is not implemented yet.
+	 */
+	keysCount = _bt_choose_scan_keys(so->keyData, so->numberOfKeys, ForwardScanDirection,
+									 keyPointers, skip->fwdNotNullKeys, &stratTotal, skip->prefix);
+	_bt_create_insertion_scan_key(rel, ForwardScanDirection, keyPointers, keysCount,
+								  &skip->fwdScanKey, &stratTotal, &goback);
+
+	keysCount = _bt_choose_scan_keys(so->keyData, so->numberOfKeys, BackwardScanDirection,
+									 keyPointers, skip->bwdNotNullKeys, &stratTotal, skip->prefix);
+	_bt_create_insertion_scan_key(rel, BackwardScanDirection, keyPointers, keysCount,
+								  &skip->bwdScanKey, &stratTotal, &goback);
+
+	_bt_metaversion(rel, &skip->curPos.skipScanKey.heapkeyspace,
+					&skip->curPos.skipScanKey.allequalimage);
+	skip->curPos.skipScanKey.anynullkeys = false; /* unused */
+	skip->curPos.skipScanKey.nextkey = false;
+	skip->curPos.skipScanKey.pivotsearch = false;
+	skip->curPos.skipScanKey.scantid = NULL;
+	skip->curPos.skipScanKey.keysz = 0;
+
+	/* setup scankey for the current tuple as well. it's not necessarily that
+	 * we will use the data from the current tuple already,
+	 * but we need the rest of the data structure to be set up correctly
+	 * for when we use it to create skip->curPos.skipScanKey keys later
+	 */
+	_bt_mkscankey(rel, NULL, &skip->currentTupleKey);
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * 								within a page, specified by the buffer.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+						Buffer buf)
+{
+	/* @todo: optimization is still possible here to
+	 * only check either the low or the high, depending on
+	 * which direction *we came from* AND which direction
+	 * *we are planning to scan*
+	 */
+	OffsetNumber low, high;
+	Page page = BufferGetPage(buf);
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	int			ans_lo, ans_hi;
+
+	low = P_FIRSTDATAKEY(opaque);
+	high = PageGetMaxOffsetNumber(page);
+
+	if (unlikely(high < low))
+		return false;
+
+	ans_lo = _bt_compare(scan->indexRelation,
+					   key, page, low);
+	ans_hi = _bt_compare(scan->indexRelation,
+					   key, page, high);
+	if (key->nextkey)
+	{
+		/* sk < last && sk >= first */
+		return ans_lo >= 0 && ans_hi == -1;
+	}
+	else
+	{
+		/* sk <= last && sk > first */
+		return ans_lo == 1 && ans_hi <= 0;
+	}
+}
+
+/* in: pinned and locked, out: pinned and locked (unless end of scan) */
+static void
+_bt_skip_find(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+			  BTScanInsert scanKey, ScanDirection dir)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	OffsetNumber offnum;
+	BTStack stack;
+	Buffer buf;
+	bool goback;
+	Page		page;
+	BTPageOpaque opaque;
+	OffsetNumber minoff;
+	Relation rel = scan->indexRelation;
+	bool fromroot = true;
+
+	_bt_set_bsearch_flags(scanKey->scankeys[scanKey->keysz - 1].sk_strategy, dir, &scanKey->nextkey, &goback);
+
+	if ((DEBUG1 >= log_min_messages || DEBUG1 >= client_min_messages) && !IsCatalogRelation(rel))
+	{
+		if (*curTuple != NULL)
+			print_itup(BufferGetBlockNumber(so->currPos.buf), *curTuple, NULL, rel,
+						"before btree search");
+
+		elog(DEBUG1, "%s searching tree with %d keys, nextkey=%d, goback=%d",
+			 RelationGetRelationName(rel), scanKey->keysz, scanKey->nextkey,
+			 goback);
+
+		_print_skey(scan, scanKey);
+	}
+
+	if (*curTupleOffnum == InvalidOffsetNumber)
+	{
+		BTScanPosUnpinIfPinned(so->currPos);
+	}
+	else
+	{
+		if (_bt_scankey_within_page(scan, scanKey, so->currPos.buf))
+		{
+			elog(DEBUG1, "sk found within current page");
+
+			offnum = _bt_binsrch(scan->indexRelation, scanKey, so->currPos.buf);
+			fromroot = false;
+		}
+		else
+		{
+			_bt_unlockbuf(rel, so->currPos.buf);
+			ReleaseBuffer(so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+		}
+	}
+
+	/*
+	 * We haven't found scan key within the current page, so let's scan from
+	 * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+	 * number
+	 */
+	if (fromroot)
+	{
+		stack = _bt_search(scan->indexRelation, scanKey,
+						   &buf, BT_READ, scan->xs_snapshot);
+		_bt_freestack(stack);
+		so->currPos.buf = buf;
+
+		offnum = _bt_binsrch(scan->indexRelation, scanKey, buf);
+
+		/* Lock the page for SERIALIZABLE transactions */
+		PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+						  scan->xs_snapshot);
+	}
+
+	page = BufferGetPage(so->currPos.buf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	if (goback)
+	{
+		offnum = OffsetNumberPrev(offnum);
+		minoff = P_FIRSTDATAKEY(opaque);
+		if (offnum < minoff)
+		{
+			_bt_unlockbuf(rel, so->currPos.buf);
+			if (!_bt_step_back_page(scan, curTuple, curTupleOffnum))
+				return;
+			page = BufferGetPage(so->currPos.buf);
+			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			offnum = PageGetMaxOffsetNumber(page);
+		}
+	}
+	else if (offnum > PageGetMaxOffsetNumber(page))
+	{
+		BlockNumber next = opaque->btpo_next;
+		_bt_unlockbuf(rel, so->currPos.buf);
+		if (!_bt_step_forward_page(scan, next, curTuple, curTupleOffnum))
+			return;
+		page = BufferGetPage(so->currPos.buf);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		offnum = P_FIRSTDATAKEY(opaque);
+	}
+
+	/* We know in which direction to look */
+	_bt_initialize_more_data(so, dir);
+
+	*curTupleOffnum = offnum;
+	*curTuple = _bt_get_tuple_from_offset(so, offnum);
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	if (DEBUG1 >= log_min_messages || DEBUG1 >= client_min_messages)
+		print_itup(BufferGetBlockNumber(so->currPos.buf), *curTuple, NULL, rel,
+					"after btree search");
+}
+
+static inline bool
+_bt_step_one_page(IndexScanDesc scan, ScanDirection dir, IndexTuple *curTuple,
+				  OffsetNumber *curTupleOffnum)
+{
+	if (ScanDirectionIsForward(dir))
+	{
+		BTScanOpaque so = (BTScanOpaque) scan->opaque;
+		return _bt_step_forward_page(scan, so->currPos.nextPage, curTuple, curTupleOffnum);
+	}
+	else
+	{
+		return _bt_step_back_page(scan, curTuple, curTupleOffnum);
+	}
+}
+
+/* in: possibly pinned, but unlocked, out: pinned and locked */
+bool
+_bt_step_forward_page(IndexScanDesc scan, BlockNumber next, IndexTuple *curTuple,
+					  OffsetNumber *curTupleOffnum)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation rel = scan->indexRelation;
+	BlockNumber blkno = next;
+	Page page;
+	BTPageOpaque opaque;
+
+	Assert(BTScanPosIsValid(so->currPos));
+
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_bt_killitems(scan);
+
+	/*
+	 * Before we modify currPos, make a copy of the page data if there was a
+	 * mark position that needs it.
+	 */
+	if (so->markItemIndex >= 0)
+	{
+		/* bump pin on current buffer for assignment to mark buffer */
+		if (BTScanPosIsPinned(so->currPos))
+			IncrBufferRefCount(so->currPos.buf);
+		memcpy(&so->markPos, &so->currPos,
+			   offsetof(BTScanPosData, items[1]) +
+			   so->currPos.lastItem * sizeof(BTScanPosItem));
+		if (so->markTuples)
+			memcpy(so->markTuples, so->currTuples,
+				   so->currPos.nextTupleOffset);
+		so->markPos.itemIndex = so->markItemIndex;
+		if (so->skipData)
+			memcpy(&so->skipData->markPos, &so->skipData->curPos,
+				   sizeof(BTSkipPosData));
+		so->markItemIndex = -1;
+	}
+
+	/* Remember we left a page with data */
+	so->currPos.moreLeft = true;
+
+	/* release the previous buffer, if pinned */
+	BTScanPosUnpinIfPinned(so->currPos);
+
+	{
+		for (;;)
+		{
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
+			if (blkno == P_NONE)
+			{
+				_bt_parallel_done(scan);
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+
+			/* check for interrupts while we're not holding any buffer lock */
+			CHECK_FOR_INTERRUPTS();
+			/* step right one page */
+			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			page = BufferGetPage(so->currPos.buf);
+			TestForOldSnapshot(scan->xs_snapshot, rel, page);
+			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			/* check for deleted page */
+			if (!P_IGNORE(opaque))
+			{
+				PredicateLockPage(rel, blkno, scan->xs_snapshot);
+				*curTupleOffnum = P_FIRSTDATAKEY(opaque);
+				*curTuple = _bt_get_tuple_from_offset(so, *curTupleOffnum);
+				break;
+			}
+
+			blkno = opaque->btpo_next;
+			_bt_relbuf(rel, so->currPos.buf);
+		}
+	}
+
+	return true;
+}
+
+/* in: possibly pinned, but unlocked, out: pinned and locked */
+bool
+_bt_step_back_page(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(BTScanPosIsValid(so->currPos));
+
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_bt_killitems(scan);
+
+	/*
+	 * Before we modify currPos, make a copy of the page data if there was a
+	 * mark position that needs it.
+	 */
+	if (so->markItemIndex >= 0)
+	{
+		/* bump pin on current buffer for assignment to mark buffer */
+		if (BTScanPosIsPinned(so->currPos))
+			IncrBufferRefCount(so->currPos.buf);
+		memcpy(&so->markPos, &so->currPos,
+			   offsetof(BTScanPosData, items[1]) +
+			   so->currPos.lastItem * sizeof(BTScanPosItem));
+		if (so->markTuples)
+			memcpy(so->markTuples, so->currTuples,
+				   so->currPos.nextTupleOffset);
+		if (so->skipData)
+			memcpy(&so->skipData->markPos, &so->skipData->curPos,
+				   sizeof(BTSkipPosData));
+		so->markPos.itemIndex = so->markItemIndex;
+		so->markItemIndex = -1;
+	}
+
+	/* Remember we left a page with data */
+	so->currPos.moreRight = true;
+
+	/* Not parallel, so just use our own notion of the current page */
+
+	{
+		Relation	rel;
+		Page		page;
+		BTPageOpaque opaque;
+
+		rel = scan->indexRelation;
+
+		if (BTScanPosIsPinned(so->currPos))
+			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
+		else
+			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+
+		for (;;)
+		{
+			/* Step to next physical page */
+			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
+											scan->xs_snapshot);
+
+			/* if we're physically at end of index, return failure */
+			if (so->currPos.buf == InvalidBuffer)
+			{
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+
+			/*
+			 * Okay, we managed to move left to a non-deleted page. Done if
+			 * it's not half-dead and contains matching tuples. Else loop back
+			 * and do it all again.
+			 */
+			page = BufferGetPage(so->currPos.buf);
+			TestForOldSnapshot(scan->xs_snapshot, rel, page);
+			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			if (!P_IGNORE(opaque))
+			{
+				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
+				*curTupleOffnum = PageGetMaxOffsetNumber(page);
+				*curTuple = _bt_get_tuple_from_offset(so, *curTupleOffnum);
+				break;
+			}
+		}
+	}
+
+	return true;
+}
+
+/* holds lock as long as curTupleOffnum != InvalidOffsetNumber */
+bool
+_bt_skip_find_next(IndexScanDesc scan, IndexTuple curTuple, OffsetNumber curTupleOffnum,
+				   ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	BTSkipCompareResult cmp;
+
+	while (_bt_skip_is_valid(so, prefixDir, postfixDir))
+	{
+		bool found;
+		_bt_skip_until_match(scan, &curTuple, &curTupleOffnum, prefixDir, postfixDir);
+
+		while (_bt_skip_is_always_valid(so))
+		{
+			OffsetNumber first = curTupleOffnum;
+			found = _bt_readpage(scan, postfixDir, &curTupleOffnum,
+								 _bt_skip_is_regular_mode(prefixDir, postfixDir));
+			if (DEBUG1 >= log_min_messages || DEBUG1 >= client_min_messages)
+			{
+				print_itup(BufferGetBlockNumber(so->currPos.buf),
+						   _bt_get_tuple_from_offset(so, first), NULL, scan->indexRelation,
+							"first item on page compared");
+				print_itup(BufferGetBlockNumber(so->currPos.buf),
+						   _bt_get_tuple_from_offset(so, curTupleOffnum), NULL, scan->indexRelation,
+							"last item on page compared");
+			}
+			_bt_compare_current_item(scan, _bt_get_tuple_from_offset(so, curTupleOffnum),
+									 IndexRelationGetNumberOfAttributes(scan->indexRelation),
+									 postfixDir, _bt_skip_is_regular_mode(prefixDir, postfixDir), &cmp);
+			_bt_determine_next_action(scan, &cmp, first, curTupleOffnum,
+									  postfixDir, &skip->curPos.nextAction);
+			skip->curPos.nextDirection = prefixDir;
+			skip->curPos.nextSkipIndex = cmp.prefixSkipIndex;
+
+			if (found)
+			{
+				_bt_skip_update_scankey_after_read(scan, _bt_get_tuple_from_offset(so, curTupleOffnum),
+												   prefixDir, postfixDir);
+				return true;
+			}
+			else if (skip->curPos.nextAction == SkipStateNext)
+			{
+				if (curTupleOffnum != InvalidOffsetNumber)
+					_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+				if (!_bt_step_one_page(scan, postfixDir, &curTuple, &curTupleOffnum))
+					return false;
+			}
+			else if (skip->curPos.nextAction == SkipStateSkip || skip->curPos.nextAction == SkipStateSkipExtra)
+			{
+				curTuple = _bt_get_tuple_from_offset(so, curTupleOffnum);
+				_bt_skip_update_scankey_after_read(scan, curTuple, prefixDir, postfixDir);
+				_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+				curTupleOffnum = InvalidOffsetNumber;
+				curTuple = NULL;
+				break;
+			}
+			else if (skip->curPos.nextAction == SkipStateStop)
+			{
+				_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+				BTScanPosUnpinIfPinned(so->currPos);
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+			else
+			{
+				Assert(false);
+			}
+		}
+	}
+	return false;
+}
+
+void
+_bt_skip_until_match(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+					 ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	while (_bt_skip_is_valid(so, prefixDir, postfixDir) &&
+		   (skip->curPos.nextAction == SkipStateSkip || skip->curPos.nextAction == SkipStateSkipExtra))
+	{
+		_bt_skip_once(scan, curTuple, curTupleOffnum,
+					  skip->curPos.nextAction == SkipStateSkip, prefixDir, postfixDir);
+	}
+}
+
+void
+_bt_compare_current_item(IndexScanDesc scan, IndexTuple tuple, int tupnatts, ScanDirection dir,
+						 bool isRegularMode, BTSkipCompareResult* cmp)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+
+	if (_bt_skip_is_always_valid(so))
+	{
+		bool continuescan = true;
+
+		cmp->equal = _bt_checkkeys(scan, tuple, tupnatts, dir, &continuescan, &cmp->prefixSkipIndex);
+		cmp->fullKeySkip = !continuescan;
+		/* prefix can be smaller than scankey due to extra quals being added
+		 * therefore we need to compare both. @todo this can be optimized into one function call */
+		cmp->prefixCmpResult = _bt_compare_until(scan->indexRelation, &skip->curPos.skipScanKey, tuple, skip->prefix);
+		cmp->skCmpResult = _bt_compare_until(scan->indexRelation,
+											 &skip->curPos.skipScanKey, tuple, skip->curPos.skipScanKey.keysz);
+		if (cmp->prefixSkipIndex == -1)
+		{
+			cmp->prefixSkipIndex = skip->prefix;
+			cmp->prefixSkip = ScanDirectionIsForward(dir) ? cmp->prefixCmpResult < 0 : cmp->prefixCmpResult > 0;
+		}
+		else
+		{
+			int newskip = -1;
+			_bt_checkkeys_threeway(scan, tuple, tupnatts, dir, &continuescan, &newskip);
+			if (newskip != -1)
+			{
+				cmp->prefixSkip = true;
+				cmp->prefixSkipIndex = newskip;
+			}
+			else
+			{
+				cmp->prefixSkip = ScanDirectionIsForward(dir) ? cmp->prefixCmpResult < 0 : cmp->prefixCmpResult > 0;
+				cmp->prefixSkipIndex = skip->prefix;
+			}
+		}
+
+		if (DEBUG1 >= log_min_messages || DEBUG1 >= client_min_messages)
+		{
+			print_itup(BufferGetBlockNumber(so->currPos.buf), tuple, NULL, scan->indexRelation,
+						"compare item");
+			_print_skey(scan, &skip->curPos.skipScanKey);
+			elog(DEBUG1, "result: eq: %d fkskip: %d pfxskip: %d prefixcmpres: %d prefixskipidx: %d", cmp->equal, cmp->fullKeySkip,
+				 _bt_should_prefix_skip(cmp), cmp->prefixCmpResult, cmp->prefixSkipIndex);
+		}
+	}
+	else
+	{
+		/* we cannot stop the scan if !isRegularMode - then we do need to skip to the next prefix */
+		cmp->fullKeySkip = isRegularMode;
+		cmp->equal = false;
+		cmp->prefixCmpResult = -2;
+		cmp->prefixSkip = true;
+		cmp->prefixSkipIndex = skip->prefix;
+		cmp->skCmpResult = -2;
+	}
+}
+
+void
+_bt_skip_once(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+			  bool forceSkip, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	BTSkipCompareResult cmp;
+	bool doskip = forceSkip;
+	int skipIndex = skip->curPos.nextSkipIndex;
+	skip->curPos.nextAction = SkipStateSkipExtra;
+
+	while (doskip)
+	{
+		int toskip = skipIndex;
+		if (*curTuple != NULL)
+		{
+			if (skip->prefix <= skipIndex || !_bt_skip_is_regular_mode(prefixDir, postfixDir))
+			{
+				toskip = skip->prefix;
+			}
+
+			_bt_skip_update_scankey_for_prefix_skip(scan, scan->indexRelation,
+													toskip, *curTuple, prefixDir);
+		}
+
+		_bt_skip_find(scan, curTuple, curTupleOffnum, &skip->curPos.skipScanKey, prefixDir);
+
+		if (_bt_skip_is_always_valid(so))
+		{
+			_bt_skip_update_scankey_for_extra_skip(scan, scan->indexRelation,
+												   prefixDir, prefixDir, true, *curTuple);
+			_bt_compare_current_item(scan, *curTuple,
+									 IndexRelationGetNumberOfAttributes(scan->indexRelation),
+									 prefixDir,
+									 _bt_skip_is_regular_mode(prefixDir, postfixDir), &cmp);
+			skipIndex = cmp.prefixSkipIndex;
+			_bt_determine_next_action_after_skip(so, &cmp, prefixDir,
+												 postfixDir, toskip, &skip->curPos.nextAction);
+		}
+		else
+		{
+			skip->curPos.nextAction = SkipStateStop;
+		}
+		doskip = skip->curPos.nextAction == SkipStateSkip;
+	}
+	if (skip->curPos.nextAction != SkipStateStop && skip->curPos.nextAction != SkipStateNext)
+		_bt_skip_extra_conditions(scan, curTuple, curTupleOffnum, prefixDir, postfixDir, &cmp);
+}
+
+void
+_bt_skip_extra_conditions(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+						  ScanDirection prefixDir, ScanDirection postfixDir, BTSkipCompareResult *cmp)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	bool regularMode = _bt_skip_is_regular_mode(prefixDir, postfixDir);
+	if (_bt_skip_is_always_valid(so))
+	{
+		do
+		{
+			if (*curTuple != NULL)
+				_bt_skip_update_scankey_for_extra_skip(scan, scan->indexRelation,
+													   postfixDir, prefixDir, false, *curTuple);
+			_bt_skip_find(scan, curTuple, curTupleOffnum, &skip->curPos.skipScanKey, postfixDir);
+			_bt_compare_current_item(scan, *curTuple,
+									 IndexRelationGetNumberOfAttributes(scan->indexRelation),
+									 postfixDir, _bt_skip_is_regular_mode(prefixDir, postfixDir), cmp);
+		} while (regularMode && cmp->prefixCmpResult != 0 && !cmp->equal && !cmp->fullKeySkip);
+		skip->curPos.nextSkipIndex = cmp->prefixSkipIndex;
+	}
+	_bt_determine_next_action_after_skip_extra(so, cmp, &skip->curPos.nextAction);
+}
+
+static void
+_bt_skip_update_scankey_after_read(IndexScanDesc scan, IndexTuple curTuple,
+								   ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	if (skip->curPos.nextAction == SkipStateSkip)
+	{
+		int toskip = skip->curPos.nextSkipIndex;
+		if (skip->prefix <= skip->curPos.nextSkipIndex ||
+				!_bt_skip_is_regular_mode(prefixDir, postfixDir))
+		{
+			toskip = skip->prefix;
+		}
+
+		if (_bt_skip_is_regular_mode(prefixDir, postfixDir))
+			_bt_skip_update_scankey_for_prefix_skip(scan, scan->indexRelation,
+													toskip, curTuple, prefixDir);
+		else
+			_bt_skip_update_scankey_for_prefix_skip(scan, scan->indexRelation,
+													toskip, NULL, prefixDir);
+	}
+	else if (skip->curPos.nextAction == SkipStateSkipExtra)
+	{
+		_bt_skip_update_scankey_for_extra_skip(scan, scan->indexRelation,
+											   postfixDir, prefixDir, false, curTuple);
+	}
+}
+
+static inline int
+_bt_compare_one(ScanKey scankey, Datum datum2, bool isNull2)
+{
+	int32		result;
+	Datum datum1 = scankey->sk_argument;
+	bool isNull1 = scankey->sk_flags & SK_ISNULL;
+	/* see comments about NULLs handling in btbuild */
+	if (isNull1)	/* key is NULL */
+	{
+		if (isNull2)
+			result = 0;		/* NULL "=" NULL */
+		else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;	/* NULL "<" NOT_NULL */
+		else
+			result = 1;		/* NULL ">" NOT_NULL */
+	}
+	else if (isNull2)		/* key is NOT_NULL and item is NULL */
+	{
+		if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;		/* NOT_NULL ">" NULL */
+		else
+			result = -1;	/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * The sk_func needs to be passed the index value as left arg and
+		 * the sk_argument as right arg (they might be of different
+		 * types).  Since it is convenient for callers to think of
+		 * _bt_compare as comparing the scankey to the index item, we have
+		 * to flip the sign of the comparison result.  (Unless it's a DESC
+		 * column, in which case we *don't* flip the sign.)
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+												 scankey->sk_collation,
+												 datum2,
+												 datum1));
+
+		if (!(scankey->sk_flags & SK_BT_DESC))
+			INVERT_COMPARE_RESULT(result);
+	}
+	return result;
+}
+
+/*
+ * set up new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_scankey_with_tuple(BTScanInsert insertKey, Relation indexRel, IndexTuple itup, int numattrs)
+{
+	TupleDesc		itupdesc;
+	int				i;
+	ScanKey			scankeys = insertKey->scankeys;
+
+	insertKey->keysz = numattrs;
+	itupdesc = RelationGetDescr(indexRel);
+	for (i = 0; i < numattrs; i++)
+	{
+		Datum datum;
+		bool null;
+		int flags;
+
+		datum = index_getattr(itup, i + 1, itupdesc, &null);
+		flags = (null ? SK_ISNULL : 0) |
+				(indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+		scankeys[i].sk_flags = flags;
+		scankeys[i].sk_argument = datum;
+	}
+}
+
+/* copy the elements important to a skip from one insertion sk to another */
+static inline void
+_bt_copy_scankey(BTScanInsert to, BTScanInsert from, int numattrs)
+{
+	memcpy(to->scankeys, from->scankeys, sizeof(ScanKeyData) * (unsigned long)numattrs);
+	to->nextkey = from->nextkey;
+	to->keysz = numattrs;
+}
+
+/*
+ * Updates the existing scankey for skipping to the next prefix
+ * alwaysUsePrefix determines how many attrs the scankey will have
+ * when true, it will always have skip->prefix number of attributes,
+ * otherwise, the value can be less, which will be determined by the comparison
+ * result with the current tuple.
+ * for example, a SELECT * FROM tbl WHERE b<2, index (a,b,c) and when skipping with prefix size=2
+ * if we encounter the tuple (1,3,1) - this does not match the qual b<2. however, we also know that
+ * it is not useful to skip to any next qual with prefix=2 (eg. (1,4)), because that will definitely not
+ * match either. However, we do want to skip to eg. (2,0). Therefore, we skip over prefix=1 in this case.
+ *
+ * the provided itup may be null. this happens when we don't want to use the current tuple to update
+ * the scankey, but instead want to use the existing curPos.skipScanKey to fill currentTupleKey. this accounts
+ * for some edge cases.
+ */
+static void
+_bt_skip_update_scankey_for_prefix_skip(IndexScanDesc scan, Relation indexRel,
+										int prefix, IndexTuple itup, ScanDirection prefixDir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	/* we use skip->prefix is alwaysUsePrefix is set or if skip->prefix is smaller than whatever the
+	 * comparison result provided, such that we never skip more than skip->prefix
+	 */
+	int numattrs = prefix;
+
+	if (itup != NULL)
+	{
+		Size		itupsz = IndexTupleSize(itup);
+		memcpy(so->skipData->curPos.skipTuple, itup, itupsz);
+
+		_bt_update_scankey_with_tuple(&skip->currentTupleKey, indexRel, (IndexTuple)so->skipData->curPos.skipTuple, numattrs);
+		_bt_copy_scankey(&skip->curPos.skipScanKey, &skip->currentTupleKey, numattrs);
+	}
+	else
+	{
+		skip->curPos.skipScanKey.keysz = numattrs;
+		_bt_copy_scankey(&skip->currentTupleKey, &skip->curPos.skipScanKey, numattrs);
+	}
+	/* update strategy for last attribute as we will use this to determine the rest of the
+	 * rest of the flags (goback) when doing the actual tree search
+	 */
+	skip->currentTupleKey.scankeys[numattrs - 1].sk_strategy =
+			skip->curPos.skipScanKey.scankeys[numattrs - 1].sk_strategy =
+			ScanDirectionIsForward(prefixDir) ? BTGreaterStrategyNumber : BTLessStrategyNumber;
+}
+
+/* update the scankey for skipping the 'extra' conditions, opportunities
+ * that arise when we have just skipped to a new prefix and can try to skip
+ * within the prefix to the right tuple by using extra quals when available
+ *
+ * @todo as an optimization it should be possible to optimize calls to this function
+ * and to _bt_skip_update_scankey_for_prefix_skip to some more specific functions that
+ * will need to do less copying of data.
+ */
+void
+_bt_skip_update_scankey_for_extra_skip(IndexScanDesc scan, Relation indexRel, ScanDirection curDir,
+									   ScanDirection prefixDir, bool prioritizeEqual, IndexTuple itup)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	BTScanInsert toCopy;
+	int i, left, lastNonTuple = skip->prefix;
+
+	/* first make sure that currentTupleKey is correct at all times */
+	_bt_skip_update_scankey_for_prefix_skip(scan, indexRel, skip->prefix, itup, prefixDir);
+	/* then do the actual work to setup curPos.skipScanKey - distinguish between work that depends on overallDir
+	 * (those attributes between attribute number 1 and 'prefix' inclusive)
+	 * and work that depends on curDir
+	 * (those attributes between attribute number 'prefix' + 1 and fwdScanKey.keysz inclusive)
+	 */
+	if (ScanDirectionIsForward(prefixDir))
+	{
+		/*
+		 * if overallDir is Forward, we need to choose between fwdScanKey or
+		 * currentTupleKey. we need to choose the most restrictive one -
+		 * in most cases this means choosing eg. a>5 over a=2 when scanning forward,
+		 * unless prioritizeEqual is set. this is done for certain special cases
+		 */
+		for (i = 0; i < skip->prefix; i++)
+		{
+			ScanKey scankey = &skip->fwdScanKey.scankeys[i];
+			ScanKey scankeyItem = &skip->currentTupleKey.scankeys[i];
+			if (scankey->sk_attno != 0 && (_bt_compare_one(scankey, scankeyItem->sk_argument, scankeyItem->sk_flags & SK_ISNULL) > 0
+										   || (prioritizeEqual && scankey->sk_strategy == BTEqualStrategyNumber)))
+			{
+				memcpy(skip->curPos.skipScanKey.scankeys + i, scankey, sizeof(ScanKeyData));
+				lastNonTuple = i;
+			}
+			else
+			{
+				if (lastNonTuple < i)
+					break;
+				memcpy(skip->curPos.skipScanKey.scankeys + i, scankeyItem, sizeof(ScanKeyData));
+			}
+			/* for now choose equal here - it could actually be improved a bit @todo by choosing the strategy
+			 * from the scankeys, but it doesn't matter a lot
+			 */
+			skip->curPos.skipScanKey.scankeys[i].sk_strategy = BTEqualStrategyNumber;
+		}
+	}
+	else
+	{
+		/* similar for backward but in opposite direction */
+		for (i = 0; i < skip->prefix; i++)
+		{
+			ScanKey scankey = &skip->bwdScanKey.scankeys[i];
+			ScanKey scankeyItem = &skip->currentTupleKey.scankeys[i];
+			if (scankey->sk_attno != 0 && (_bt_compare_one(scankey, scankeyItem->sk_argument, scankeyItem->sk_flags & SK_ISNULL) < 0
+										   || (prioritizeEqual && scankey->sk_strategy == BTEqualStrategyNumber)))
+			{
+				memcpy(skip->curPos.skipScanKey.scankeys + i, scankey, sizeof(ScanKeyData));
+				lastNonTuple = i;
+			}
+			else
+			{
+				if (lastNonTuple < i)
+					break;
+				memcpy(skip->curPos.skipScanKey.scankeys + i, scankeyItem, sizeof(ScanKeyData));
+			}
+			skip->curPos.skipScanKey.scankeys[i].sk_strategy = BTEqualStrategyNumber;
+		}
+	}
+
+	/*
+	 * the remaining keys are the quals after the prefix
+	 */
+	if (ScanDirectionIsForward(curDir))
+		toCopy = &skip->fwdScanKey;
+	else
+		toCopy = &skip->bwdScanKey;
+
+	if (lastNonTuple >= skip->prefix - 1)
+	{
+		left = toCopy->keysz - skip->prefix;
+		if (left > 0)
+		{
+			memcpy(skip->curPos.skipScanKey.scankeys + skip->prefix, toCopy->scankeys + i, sizeof(ScanKeyData) * (unsigned long)left);
+		}
+		skip->curPos.skipScanKey.keysz = toCopy->keysz;
+	}
+	else
+	{
+		skip->curPos.skipScanKey.keysz = lastNonTuple + 1;
+	}
+}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index efee86784b..58b165866f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -560,7 +560,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 81589b9056..4c14a9e76f 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,10 +49,10 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
-static void _bt_mark_scankey_required(ScanKey skey);
+static void _bt_mark_scankey_required(ScanKey skey, int forwardReqFlag, int backwardReqFlag);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
-								 ScanDirection dir, bool *continuescan);
+								 ScanDirection dir, bool *continuescan, int *prefixskipindex);
 static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
 						   IndexTuple firstright, BTScanInsert itup_key);
 
@@ -87,9 +87,8 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, IndexTuple itup, BTScanInsert key)
 {
-	BTScanInsert key;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
 	int			indnkeyatts;
@@ -109,8 +108,10 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	 * Truncated attributes and non-key attributes are omitted from the final
 	 * scan key.
 	 */
-	key = palloc(offsetof(BTScanInsertData, scankeys) +
-				 sizeof(ScanKeyData) * indnkeyatts);
+	if (key == NULL)
+		key = palloc(offsetof(BTScanInsertData, scankeys) +
+					 sizeof(ScanKeyData) * indnkeyatts);
+
 	if (itup)
 		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
 	else
@@ -155,7 +156,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
 									   (AttrNumber) (i + 1),
-									   InvalidStrategy,
+									   BTEqualStrategyNumber,
 									   InvalidOid,
 									   rel->rd_indcollation[i],
 									   procinfo,
@@ -745,7 +746,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	int			numberOfKeys = scan->numberOfKeys;
 	int16	   *indoption = scan->indexRelation->rd_indoption;
 	int			new_numberOfKeys;
-	int			numberOfEqualCols;
+	int			numberOfEqualCols, numberOfEqualColsSincePrefix;
 	ScanKey		inkeys;
 	ScanKey		outkeys;
 	ScanKey		cur;
@@ -754,6 +755,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	int			i,
 				j;
 	AttrNumber	attno;
+	int			prefix = 0;
 
 	/* initialize result variables */
 	so->qual_ok = true;
@@ -762,6 +764,11 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	if (numberOfKeys < 1)
 		return;					/* done if qual-less scan */
 
+	if (_bt_skip_enabled(so))
+	{
+		prefix = so->skipData->prefix;
+	}
+
 	/*
 	 * Read so->arrayKeyData if array keys are present, else scan->keyData
 	 */
@@ -786,7 +793,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		so->numberOfKeys = 1;
 		/* We can mark the qual as required if it's for first index col */
 		if (cur->sk_attno == 1)
-			_bt_mark_scankey_required(outkeys);
+			_bt_mark_scankey_required(outkeys, SK_BT_REQFWD, SK_BT_REQBKWD);
+		if (cur->sk_attno <= prefix + 1)
+			_bt_mark_scankey_required(outkeys, SK_BT_REQSKIPFWD, SK_BT_REQSKIPBKWD);
 		return;
 	}
 
@@ -795,6 +804,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	 */
 	new_numberOfKeys = 0;
 	numberOfEqualCols = 0;
+	numberOfEqualColsSincePrefix = 0;
+
 
 	/*
 	 * Initialize for processing of keys for attr 1.
@@ -830,6 +841,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		if (i == numberOfKeys || cur->sk_attno != attno)
 		{
 			int			priorNumberOfEqualCols = numberOfEqualCols;
+			int			priorNumberOfEqualColsSincePrefix = numberOfEqualColsSincePrefix;
+
 
 			/* check input keys are correctly ordered */
 			if (i < numberOfKeys && cur->sk_attno < attno)
@@ -880,6 +893,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 				}
 				/* track number of attrs for which we have "=" keys */
 				numberOfEqualCols++;
+				if (attno > prefix)
+					numberOfEqualColsSincePrefix++;
 			}
 
 			/* try to keep only one of <, <= */
@@ -929,7 +944,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 
 					memcpy(outkey, xform[j], sizeof(ScanKeyData));
 					if (priorNumberOfEqualCols == attno - 1)
-						_bt_mark_scankey_required(outkey);
+						_bt_mark_scankey_required(outkey, SK_BT_REQFWD, SK_BT_REQBKWD);
+					if (attno <= prefix || priorNumberOfEqualColsSincePrefix == attno - prefix - 1)
+						_bt_mark_scankey_required(outkey, SK_BT_REQSKIPFWD, SK_BT_REQSKIPBKWD);
 				}
 			}
 
@@ -954,7 +971,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 
 			memcpy(outkey, cur, sizeof(ScanKeyData));
 			if (numberOfEqualCols == attno - 1)
-				_bt_mark_scankey_required(outkey);
+				_bt_mark_scankey_required(outkey, SK_BT_REQFWD, SK_BT_REQBKWD);
+			if (attno <= prefix || numberOfEqualColsSincePrefix == attno - prefix - 1)
+				_bt_mark_scankey_required(outkey, SK_BT_REQSKIPFWD, SK_BT_REQSKIPBKWD);
 
 			/*
 			 * We don't support RowCompare using equality; such a qual would
@@ -997,7 +1016,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 
 				memcpy(outkey, cur, sizeof(ScanKeyData));
 				if (numberOfEqualCols == attno - 1)
-					_bt_mark_scankey_required(outkey);
+					_bt_mark_scankey_required(outkey, SK_BT_REQFWD, SK_BT_REQBKWD);
+				if (attno <= prefix || numberOfEqualColsSincePrefix == attno - prefix - 1)
+					_bt_mark_scankey_required(outkey, SK_BT_REQSKIPFWD, SK_BT_REQSKIPBKWD);
 			}
 		}
 	}
@@ -1295,7 +1316,7 @@ _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption)
  * anyway on a rescan.  Something to keep an eye on though.
  */
 static void
-_bt_mark_scankey_required(ScanKey skey)
+_bt_mark_scankey_required(ScanKey skey, int forwardReqFlag, int backwardReqFlag)
 {
 	int			addflags;
 
@@ -1303,14 +1324,14 @@ _bt_mark_scankey_required(ScanKey skey)
 	{
 		case BTLessStrategyNumber:
 		case BTLessEqualStrategyNumber:
-			addflags = SK_BT_REQFWD;
+			addflags = forwardReqFlag;
 			break;
 		case BTEqualStrategyNumber:
-			addflags = SK_BT_REQFWD | SK_BT_REQBKWD;
+			addflags = forwardReqFlag | backwardReqFlag;
 			break;
 		case BTGreaterEqualStrategyNumber:
 		case BTGreaterStrategyNumber:
-			addflags = SK_BT_REQBKWD;
+			addflags = backwardReqFlag;
 			break;
 		default:
 			elog(ERROR, "unrecognized StrategyNumber: %d",
@@ -1353,17 +1374,22 @@ _bt_mark_scankey_required(ScanKey skey)
  */
 bool
 _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan)
+			  ScanDirection dir, bool *continuescan, int *prefixSkipIndex)
 {
 	TupleDesc	tupdesc;
 	BTScanOpaque so;
 	int			keysz;
 	int			ikey;
 	ScanKey		key;
+	int pfx;
+
+	if (prefixSkipIndex == NULL)
+		prefixSkipIndex = &pfx;
 
 	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
 
 	*continuescan = true;		/* default assumption */
+	*prefixSkipIndex = -1;
 
 	tupdesc = RelationGetDescr(scan->indexRelation);
 	so = (BTScanOpaque) scan->opaque;
@@ -1392,7 +1418,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
 			if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
-									 continuescan))
+									 continuescan, prefixSkipIndex))
 				continue;
 			return false;
 		}
@@ -1429,6 +1455,13 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 					 ScanDirectionIsBackward(dir))
 				*continuescan = false;
 
+			if ((key->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+			else if ((key->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1452,6 +1485,10 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
 					*continuescan = false;
+
+				if ((key->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQSKIPBKWD)) &&
+					ScanDirectionIsBackward(dir))
+					*prefixSkipIndex = key->sk_attno - 1;
 			}
 			else
 			{
@@ -1468,6 +1505,9 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
 					*continuescan = false;
+				if ((key->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQSKIPBKWD)) &&
+									ScanDirectionIsBackward(dir))
+									*prefixSkipIndex = key->sk_attno - 1;
 			}
 
 			/*
@@ -1498,6 +1538,206 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 					 ScanDirectionIsBackward(dir))
 				*continuescan = false;
 
+			if ((key->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+			else if ((key->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+
+			/*
+			 * In any case, this indextuple doesn't match the qual.
+			 */
+			return false;
+		}
+	}
+
+	/* If we get here, the tuple passes all index quals. */
+	return true;
+}
+
+bool
+_bt_checkkeys_threeway(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+			  ScanDirection dir, bool *continuescan, int *prefixSkipIndex)
+{
+	TupleDesc	tupdesc;
+	BTScanOpaque so;
+	int			keysz;
+	int			ikey;
+	ScanKey		key;
+	int pfx;
+	BTScanInsert keys;
+
+	if (prefixSkipIndex == NULL)
+		prefixSkipIndex = &pfx;
+
+	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+
+	*continuescan = true;		/* default assumption */
+	*prefixSkipIndex = -1;
+
+	tupdesc = RelationGetDescr(scan->indexRelation);
+	so = (BTScanOpaque) scan->opaque;
+	if (ScanDirectionIsForward(dir))
+		keys = &so->skipData->bwdScanKey;
+	else
+		keys = &so->skipData->fwdScanKey;
+
+	keysz = keys->keysz;
+
+	for (key = keys->scankeys, ikey = 0; ikey < keysz; key++, ikey++)
+	{
+		Datum		datum;
+		bool		isNull;
+		int		cmpresult;
+
+		if (key->sk_attno == 0)
+			continue;
+
+		if (key->sk_attno > tupnatts)
+		{
+			/*
+			 * This attribute is truncated (must be high key).  The value for
+			 * this attribute in the first non-pivot tuple on the page to the
+			 * right could be any possible value.  Assume that truncated
+			 * attribute passes the qual.
+			 */
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
+		/* row-comparison keys need special processing */
+		Assert((key->sk_flags & SK_ROW_HEADER) == 0);
+
+		datum = index_getattr(tuple,
+							  key->sk_attno,
+							  tupdesc,
+							  &isNull);
+
+		if (key->sk_flags & SK_ISNULL)
+		{
+			/* Handle IS NULL/NOT NULL tests */
+			if (key->sk_flags & SK_SEARCHNULL)
+			{
+				if (isNull)
+					continue;	/* tuple satisfies this qual */
+			}
+			else
+			{
+				Assert(key->sk_flags & SK_SEARCHNOTNULL);
+				if (!isNull)
+					continue;	/* tuple satisfies this qual */
+			}
+
+			/*
+			 * Tuple fails this qual.  If it's a required qual for the current
+			 * scan direction, then we can conclude no further tuples will
+			 * pass, either.
+			 */
+			if ((key->sk_flags & SK_BT_REQFWD) &&
+				ScanDirectionIsForward(dir))
+				*continuescan = false;
+			else if ((key->sk_flags & SK_BT_REQBKWD) &&
+					 ScanDirectionIsBackward(dir))
+				*continuescan = false;
+
+			if ((key->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+			else if ((key->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+
+			/*
+			 * In any case, this indextuple doesn't match the qual.
+			 */
+			return false;
+		}
+
+		if (isNull)
+		{
+			if (key->sk_flags & SK_BT_NULLS_FIRST)
+			{
+				/*
+				 * Since NULLs are sorted before non-NULLs, we know we have
+				 * reached the lower limit of the range of values for this
+				 * index attr.  On a backward scan, we can stop if this qual
+				 * is one of the "must match" subset.  We can stop regardless
+				 * of whether the qual is > or <, so long as it's required,
+				 * because it's not possible for any future tuples to pass. On
+				 * a forward scan, however, we must keep going, because we may
+				 * have initially positioned to the start of the index.
+				 */
+				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+					ScanDirectionIsBackward(dir))
+					*continuescan = false;
+
+				if ((key->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQSKIPBKWD)) &&
+					ScanDirectionIsBackward(dir))
+					*prefixSkipIndex = key->sk_attno - 1;
+			}
+			else
+			{
+				/*
+				 * Since NULLs are sorted after non-NULLs, we know we have
+				 * reached the upper limit of the range of values for this
+				 * index attr.  On a forward scan, we can stop if this qual is
+				 * one of the "must match" subset.  We can stop regardless of
+				 * whether the qual is > or <, so long as it's required,
+				 * because it's not possible for any future tuples to pass. On
+				 * a backward scan, however, we must keep going, because we
+				 * may have initially positioned to the end of the index.
+				 */
+				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+					ScanDirectionIsForward(dir))
+					*continuescan = false;
+				if ((key->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQSKIPBKWD)) &&
+					ScanDirectionIsBackward(dir))
+					*prefixSkipIndex = key->sk_attno - 1;
+			}
+
+			/*
+			 * In any case, this indextuple doesn't match the qual.
+			 */
+			return false;
+		}
+
+
+		/* Perform the test --- three-way comparison not bool operator */
+		cmpresult = DatumGetInt32(FunctionCall2Coll(&key->sk_func,
+													key->sk_collation,
+													datum,
+													key->sk_argument));
+
+		if (key->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(cmpresult);
+
+		if (cmpresult != 0)
+		{
+			/*
+			 * Tuple fails this qual.  If it's a required qual for the current
+			 * scan direction, then we can conclude no further tuples will
+			 * pass, either.
+			 *
+			 * Note: because we stop the scan as soon as any required equality
+			 * qual fails, it is critical that equality quals be used for the
+			 * initial positioning in _bt_first() when they are available. See
+			 * comments in _bt_first().
+			 */
+			if ((key->sk_flags & SK_BT_REQFWD) &&
+				ScanDirectionIsForward(dir) && cmpresult > 0)
+				*continuescan = false;
+			else if ((key->sk_flags & SK_BT_REQBKWD) &&
+					 ScanDirectionIsBackward(dir) && cmpresult < 0)
+				*continuescan = false;
+
+			if ((key->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir) && cmpresult > 0)
+				*prefixSkipIndex = key->sk_attno - 1;
+			else if ((key->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir) && cmpresult < 0)
+				*prefixSkipIndex = key->sk_attno - 1;
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1520,7 +1760,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
-					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
+					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan, int *prefixSkipIndex)
 {
 	ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
 	int32		cmpresult = 0;
@@ -1576,6 +1816,10 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
 					*continuescan = false;
+
+				if ((subkey->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQBKWD) &&
+					ScanDirectionIsBackward(dir)))
+					*prefixSkipIndex = subkey->sk_attno - 1;
 			}
 			else
 			{
@@ -1592,6 +1836,10 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
 					*continuescan = false;
+
+				if ((subkey->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQBKWD) &&
+					ScanDirectionIsForward(dir)))
+					*prefixSkipIndex = subkey->sk_attno - 1;
 			}
 
 			/*
@@ -1616,6 +1864,13 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
 					 ScanDirectionIsBackward(dir))
 				*continuescan = false;
+
+			if ((subkey->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir))
+				*prefixSkipIndex = subkey->sk_attno - 1;
+			else if ((subkey->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir))
+				*prefixSkipIndex = subkey->sk_attno - 1;
 			return false;
 		}
 
@@ -1678,6 +1933,13 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 		else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
 				 ScanDirectionIsBackward(dir))
 			*continuescan = false;
+
+		if ((subkey->sk_flags & SK_BT_REQSKIPFWD) &&
+			ScanDirectionIsForward(dir))
+			*prefixSkipIndex = subkey->sk_attno - 1;
+		else if ((subkey->sk_flags & SK_BT_REQSKIPBKWD) &&
+				 ScanDirectionIsBackward(dir))
+			*prefixSkipIndex = subkey->sk_attno - 1;
 	}
 
 	return result;
@@ -2748,3 +3010,524 @@ _bt_allequalimage(Relation rel, bool debugmessage)
 
 	return allequalimage;
 }
+
+void _bt_set_bsearch_flags(StrategyNumber stratTotal, ScanDirection dir, bool* nextkey, bool* goback)
+{
+	/*----------
+	 * Examine the selected initial-positioning strategy to determine exactly
+	 * where we need to start the scan, and set flag variables to control the
+	 * code below.
+	 *
+	 * If nextkey = false, _bt_search and _bt_binsrch will locate the first
+	 * item >= scan key.  If nextkey = true, they will locate the first
+	 * item > scan key.
+	 *
+	 * If goback = true, we will then step back one item, while if
+	 * goback = false, we will start the scan on the located item.
+	 *----------
+	 */
+	switch (stratTotal)
+	{
+		case BTLessStrategyNumber:
+
+			/*
+			 * Find first item >= scankey, then back up one to arrive at last
+			 * item < scankey.  (Note: this positioning strategy is only used
+			 * for a backward scan, so that is always the correct starting
+			 * position.)
+			 */
+			*nextkey = false;
+			*goback = true;
+			break;
+
+		case BTLessEqualStrategyNumber:
+
+			/*
+			 * Find first item > scankey, then back up one to arrive at last
+			 * item <= scankey.  (Note: this positioning strategy is only used
+			 * for a backward scan, so that is always the correct starting
+			 * position.)
+			 */
+			*nextkey = true;
+			*goback = true;
+			break;
+
+		case BTEqualStrategyNumber:
+
+			/*
+			 * If a backward scan was specified, need to start with last equal
+			 * item not first one.
+			 */
+			if (ScanDirectionIsBackward(dir))
+			{
+				/*
+				 * This is the same as the <= strategy.  We will check at the
+				 * end whether the found item is actually =.
+				 */
+				*nextkey = true;
+				*goback = true;
+			}
+			else
+			{
+				/*
+				 * This is the same as the >= strategy.  We will check at the
+				 * end whether the found item is actually =.
+				 */
+				*nextkey = false;
+				*goback = false;
+			}
+			break;
+
+		case BTGreaterEqualStrategyNumber:
+
+			/*
+			 * Find first item >= scankey.  (This is only used for forward
+			 * scans.)
+			 */
+			*nextkey = false;
+			*goback = false;
+			break;
+
+		case BTGreaterStrategyNumber:
+
+			/*
+			 * Find first item > scankey.  (This is only used for forward
+			 * scans.)
+			 */
+			*nextkey = true;
+			*goback = false;
+			break;
+
+		default:
+			/* can't get here, but keep compiler quiet */
+			elog(ERROR, "unrecognized strat_total: %d", (int) stratTotal);
+	}
+}
+
+bool _bt_create_insertion_scan_key(Relation	rel, ScanDirection dir, ScanKey* startKeys, int keysCount, BTScanInsert inskey, StrategyNumber* stratTotal,  bool* goback)
+{
+	int i;
+	bool nextkey;
+
+	/*
+	 * We want to start the scan somewhere within the index.  Set up an
+	 * insertion scankey we can use to search for the boundary point we
+	 * identified above.  The insertion scankey is built using the keys
+	 * identified by startKeys[].  (Remaining insertion scankey fields are
+	 * initialized after initial-positioning strategy is finalized.)
+	 */
+	Assert(keysCount <= INDEX_MAX_KEYS);
+	for (i = 0; i < keysCount; i++)
+	{
+		ScanKey		cur = startKeys[i];
+
+		if (cur == NULL)
+		{
+			inskey->scankeys[i].sk_attno = 0;
+			continue;
+		}
+
+		Assert(cur->sk_attno == i + 1);
+
+		if (cur->sk_flags & SK_ROW_HEADER)
+		{
+			/*
+			 * Row comparison header: look to the first row member instead.
+			 *
+			 * The member scankeys are already in insertion format (ie, they
+			 * have sk_func = 3-way-comparison function), but we have to watch
+			 * out for nulls, which _bt_preprocess_keys didn't check. A null
+			 * in the first row member makes the condition unmatchable, just
+			 * like qual_ok = false.
+			 */
+			ScanKey		subkey = (ScanKey) DatumGetPointer(cur->sk_argument);
+
+			Assert(subkey->sk_flags & SK_ROW_MEMBER);
+			if (subkey->sk_flags & SK_ISNULL)
+			{
+				return false;
+			}
+			memcpy(inskey->scankeys + i, subkey, sizeof(ScanKeyData));
+
+			/*
+			 * If the row comparison is the last positioning key we accepted,
+			 * try to add additional keys from the lower-order row members.
+			 * (If we accepted independent conditions on additional index
+			 * columns, we use those instead --- doesn't seem worth trying to
+			 * determine which is more restrictive.)  Note that this is OK
+			 * even if the row comparison is of ">" or "<" type, because the
+			 * condition applied to all but the last row member is effectively
+			 * ">=" or "<=", and so the extra keys don't break the positioning
+			 * scheme.  But, by the same token, if we aren't able to use all
+			 * the row members, then the part of the row comparison that we
+			 * did use has to be treated as just a ">=" or "<=" condition, and
+			 * so we'd better adjust strat_total accordingly.
+			 */
+			if (i == keysCount - 1)
+			{
+				bool		used_all_subkeys = false;
+
+				Assert(!(subkey->sk_flags & SK_ROW_END));
+				for (;;)
+				{
+					subkey++;
+					Assert(subkey->sk_flags & SK_ROW_MEMBER);
+					if (subkey->sk_attno != keysCount + 1)
+						break;	/* out-of-sequence, can't use it */
+					if (subkey->sk_strategy != cur->sk_strategy)
+						break;	/* wrong direction, can't use it */
+					if (subkey->sk_flags & SK_ISNULL)
+						break;	/* can't use null keys */
+					Assert(keysCount < INDEX_MAX_KEYS);
+					memcpy(inskey->scankeys + keysCount, subkey,
+						   sizeof(ScanKeyData));
+					keysCount++;
+					if (subkey->sk_flags & SK_ROW_END)
+					{
+						used_all_subkeys = true;
+						break;
+					}
+				}
+				if (!used_all_subkeys)
+				{
+					switch (*stratTotal)
+					{
+						case BTLessStrategyNumber:
+							*stratTotal = BTLessEqualStrategyNumber;
+							break;
+						case BTGreaterStrategyNumber:
+							*stratTotal = BTGreaterEqualStrategyNumber;
+							break;
+					}
+				}
+				break;			/* done with outer loop */
+			}
+		}
+		else
+		{
+			/*
+			 * Ordinary comparison key.  Transform the search-style scan key
+			 * to an insertion scan key by replacing the sk_func with the
+			 * appropriate btree comparison function.
+			 *
+			 * If scankey operator is not a cross-type comparison, we can use
+			 * the cached comparison function; otherwise gotta look it up in
+			 * the catalogs.  (That can't lead to infinite recursion, since no
+			 * indexscan initiated by syscache lookup will use cross-data-type
+			 * operators.)
+			 *
+			 * We support the convention that sk_subtype == InvalidOid means
+			 * the opclass input type; this is a hack to simplify life for
+			 * ScanKeyInit().
+			 */
+			if (cur->sk_subtype == rel->rd_opcintype[i] ||
+				cur->sk_subtype == InvalidOid)
+			{
+				FmgrInfo   *procinfo;
+
+				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
+				ScanKeyEntryInitializeWithInfo(inskey->scankeys + i,
+											   cur->sk_flags,
+											   cur->sk_attno,
+											   cur->sk_strategy,
+											   cur->sk_subtype,
+											   cur->sk_collation,
+											   procinfo,
+											   cur->sk_argument);
+			}
+			else
+			{
+				RegProcedure cmp_proc;
+
+				cmp_proc = get_opfamily_proc(rel->rd_opfamily[i],
+											 rel->rd_opcintype[i],
+											 cur->sk_subtype,
+											 BTORDER_PROC);
+				if (!RegProcedureIsValid(cmp_proc))
+					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
+						 cur->sk_attno, RelationGetRelationName(rel));
+				ScanKeyEntryInitialize(inskey->scankeys + i,
+									   cur->sk_flags,
+									   cur->sk_attno,
+									   cur->sk_strategy,
+									   cur->sk_subtype,
+									   cur->sk_collation,
+									   cmp_proc,
+									   cur->sk_argument);
+			}
+		}
+	}
+
+	_bt_set_bsearch_flags(*stratTotal, dir, &nextkey, goback);
+
+	/* Initialize remaining insertion scan key fields */
+	_bt_metaversion(rel, &inskey->heapkeyspace, &inskey->allequalimage);
+	inskey->anynullkeys = false; /* unused */
+	inskey->nextkey = nextkey;
+	inskey->pivotsearch = false;
+	inskey->scantid = NULL;
+	inskey->keysz = keysCount;
+
+	return true;
+}
+
+/*----------
+ * Examine the scan keys to discover where we need to start the scan.
+ *
+ * We want to identify the keys that can be used as starting boundaries;
+ * these are =, >, or >= keys for a forward scan or =, <, <= keys for
+ * a backwards scan.  We can use keys for multiple attributes so long as
+ * the prior attributes had only =, >= (resp. =, <=) keys.  Once we accept
+ * a > or < boundary or find an attribute with no boundary (which can be
+ * thought of as the same as "> -infinity"), we can't use keys for any
+ * attributes to its right, because it would break our simplistic notion
+ * of what initial positioning strategy to use.
+ *
+ * When the scan keys include cross-type operators, _bt_preprocess_keys
+ * may not be able to eliminate redundant keys; in such cases we will
+ * arbitrarily pick a usable one for each attribute.  This is correct
+ * but possibly not optimal behavior.  (For example, with keys like
+ * "x >= 4 AND x >= 5" we would elect to scan starting at x=4 when
+ * x=5 would be more efficient.)  Since the situation only arises given
+ * a poorly-worded query plus an incomplete opfamily, live with it.
+ *
+ * When both equality and inequality keys appear for a single attribute
+ * (again, only possible when cross-type operators appear), we *must*
+ * select one of the equality keys for the starting point, because
+ * _bt_checkkeys() will stop the scan as soon as an equality qual fails.
+ * For example, if we have keys like "x >= 4 AND x = 10" and we elect to
+ * start at x=4, we will fail and stop before reaching x=10.  If multiple
+ * equality quals survive preprocessing, however, it doesn't matter which
+ * one we use --- by definition, they are either redundant or
+ * contradictory.
+ *
+ * Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
+ * If the index stores nulls at the end of the index we'll be starting
+ * from, and we have no boundary key for the column (which means the key
+ * we deduced NOT NULL from is an inequality key that constrains the other
+ * end of the index), then we cons up an explicit SK_SEARCHNOTNULL key to
+ * use as a boundary key.  If we didn't do this, we might find ourselves
+ * traversing a lot of null entries at the start of the scan.
+ *
+ * In this loop, row-comparison keys are treated the same as keys on their
+ * first (leftmost) columns.  We'll add on lower-order columns of the row
+ * comparison below, if possible.
+ *
+ * The selected scan keys (at most one per index column) are remembered by
+ * storing their addresses into the local startKeys[] array.
+ *----------
+ */
+int _bt_choose_scan_keys(ScanKey scanKeys, int numberOfKeys, ScanDirection dir, ScanKey* startKeys, ScanKeyData* notnullkeys, StrategyNumber* stratTotal, int prefix)
+{
+	StrategyNumber strat;
+	int			keysCount = 0;
+	int			i;
+
+	*stratTotal = BTEqualStrategyNumber;
+	if (numberOfKeys > 0 || prefix > 0)
+	{
+		AttrNumber	curattr;
+		ScanKey		chosen;
+		ScanKey		impliesNN;
+		ScanKey		cur;
+
+		/*
+		 * chosen is the so-far-chosen key for the current attribute, if any.
+		 * We don't cast the decision in stone until we reach keys for the
+		 * next attribute.
+		 */
+		curattr = 1;
+		chosen = NULL;
+		/* Also remember any scankey that implies a NOT NULL constraint */
+		impliesNN = NULL;
+
+		/*
+		 * Loop iterates from 0 to numberOfKeys inclusive; we use the last
+		 * pass to handle after-last-key processing.  Actual exit from the
+		 * loop is at one of the "break" statements below.
+		 */
+		for (cur = scanKeys, i = 0;; cur++, i++)
+		{
+			if (i >= numberOfKeys || cur->sk_attno != curattr)
+			{
+				/*
+				 * Done looking at keys for curattr.  If we didn't find a
+				 * usable boundary key, see if we can deduce a NOT NULL key.
+				 */
+				if (chosen == NULL && impliesNN != NULL &&
+					((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
+					 ScanDirectionIsForward(dir) :
+					 ScanDirectionIsBackward(dir)))
+				{
+					/* Yes, so build the key in notnullkeys[keysCount] */
+					chosen = &notnullkeys[keysCount];
+					ScanKeyEntryInitialize(chosen,
+										   (SK_SEARCHNOTNULL | SK_ISNULL |
+											(impliesNN->sk_flags &
+											 (SK_BT_DESC | SK_BT_NULLS_FIRST))),
+										   curattr,
+										   ((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
+											BTGreaterStrategyNumber :
+											BTLessStrategyNumber),
+										   InvalidOid,
+										   InvalidOid,
+										   InvalidOid,
+										   (Datum) 0);
+				}
+
+				/*
+				 * If we still didn't find a usable boundary key, quit; else
+				 * save the boundary key pointer in startKeys.
+				 */
+				if (chosen == NULL && curattr > prefix)
+					break;
+				startKeys[keysCount++] = chosen;
+
+				/*
+				 * Adjust strat_total, and quit if we have stored a > or <
+				 * key.
+				 */
+				if (chosen != NULL && curattr > prefix)
+				{
+					strat = chosen->sk_strategy;
+					if (strat != BTEqualStrategyNumber)
+					{
+						*stratTotal = strat;
+						if (strat == BTGreaterStrategyNumber ||
+							strat == BTLessStrategyNumber)
+							break;
+					}
+				}
+
+				/*
+				 * Done if that was the last attribute, or if next key is not
+				 * in sequence (implying no boundary key is available for the
+				 * next attribute).
+				 */
+				if (i >= numberOfKeys)
+				{
+					curattr++;
+					while(curattr <= prefix)
+					{
+						startKeys[keysCount++] = NULL;
+						curattr++;
+					}
+					break;
+				}
+				else if (cur->sk_attno != curattr + 1)
+				{
+					curattr++;
+					while(curattr < cur->sk_attno && curattr <= prefix)
+					{
+						startKeys[keysCount++] = NULL;
+						curattr++;
+					}
+					if (curattr > prefix && curattr != cur->sk_attno)
+						break;
+				}
+				else
+				{
+					curattr++;
+				}
+
+				/*
+				 * Reset for next attr.
+				 */
+				chosen = NULL;
+				impliesNN = NULL;
+			}
+
+			/*
+			 * Can we use this key as a starting boundary for this attr?
+			 *
+			 * If not, does it imply a NOT NULL constraint?  (Because
+			 * SK_SEARCHNULL keys are always assigned BTEqualStrategyNumber,
+			 * *any* inequality key works for that; we need not test.)
+			 */
+			switch (cur->sk_strategy)
+			{
+				case BTLessStrategyNumber:
+				case BTLessEqualStrategyNumber:
+					if (chosen == NULL)
+					{
+						if (ScanDirectionIsBackward(dir))
+							chosen = cur;
+						else
+							impliesNN = cur;
+					}
+					break;
+				case BTEqualStrategyNumber:
+					/* override any non-equality choice */
+					chosen = cur;
+					break;
+				case BTGreaterEqualStrategyNumber:
+				case BTGreaterStrategyNumber:
+					if (chosen == NULL)
+					{
+						if (ScanDirectionIsForward(dir))
+							chosen = cur;
+						else
+							impliesNN = cur;
+					}
+					break;
+			}
+		}
+	}
+	return keysCount;
+}
+
+void print_itup(BlockNumber blk, IndexTuple left, IndexTuple right, Relation rel, char *extra)
+{
+	bool		isnull[INDEX_MAX_KEYS];
+	Datum		values[INDEX_MAX_KEYS];
+	char	   *lkey_desc = NULL;
+	char	   *rkey_desc;
+
+	/* Avoid infinite recursion -- don't instrument catalog indexes */
+	if (!IsCatalogRelation(rel))
+	{
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(left, rel);
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(left, itupdesc, values, isnull);
+		rel->rd_index->indnkeyatts = natts;
+
+		/*
+		 * Since the regression tests should pass when the instrumentation
+		 * patch is applied, be prepared for BuildIndexValueDescription() to
+		 * return NULL due to security considerations.
+		 */
+		lkey_desc = BuildIndexValueDescription(rel, values, isnull);
+		if (lkey_desc && right)
+		{
+			/*
+			 * Revolting hack: modify tuple descriptor to have number of key
+			 * columns actually present in caller's pivot tuples
+			 */
+			natts = BTreeTupleGetNAtts(right, rel);
+			itupdesc->natts = Min(indnkeyatts, natts);
+			memset(&isnull, 0xFF, sizeof(isnull));
+			index_deform_tuple(right, itupdesc, values, isnull);
+			rel->rd_index->indnkeyatts = natts;
+			rkey_desc = BuildIndexValueDescription(rel, values, isnull);
+			elog(DEBUG1, "%s blk %u sk > %s, sk <= %s %s",
+				 RelationGetRelationName(rel), blk, lkey_desc, rkey_desc,
+				 extra);
+			pfree(rkey_desc);
+		}
+		else
+			elog(DEBUG1, "%s blk %u sk check %s %s",
+				 RelationGetRelationName(rel), blk, lkey_desc, extra);
+
+		/* Cleanup */
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
+		if (lkey_desc)
+			pfree(lkey_desc);
+	}
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 0efe05e552..54ffc05769 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -69,6 +69,9 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = spgcostestimate;
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a283e4d45c..389eb1528c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -148,6 +148,7 @@ static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
 static void ExplainIndentText(ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 
 
@@ -1096,6 +1097,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
 }
 
+/*
+ * ExplainIndexSkipScanKeys -
+ *	  Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es)
+{
+	if (skipPrefixSize > 0)
+	{
+		if (es->format != EXPLAIN_FORMAT_TEXT)
+			ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+	}
+}
+
 /*
  * ExplainNode -
  *	  Appends a description of a plan tree to es->str
@@ -1433,6 +1450,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexScan  *indexscan = (IndexScan *) plan;
 
+				if (indexscan->indexdistinct)
+					ExplainIndexSkipScanKeys(indexscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexscan->indexid,
 										indexscan->indexorderdir,
 										es);
@@ -1443,6 +1463,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
 
+				if (indexonlyscan->indexdistinct)
+					ExplainIndexSkipScanKeys(indexonlyscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexonlyscan->indexid,
 										indexonlyscan->indexorderdir,
 										es);
@@ -1703,6 +1726,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_IndexScan:
+			if (((IndexScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyText("Skip scan", ((IndexScan *) plan)->indexdistinct ? "Distinct only" : "All", es);
 			show_scan_qual(((IndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexScan *) plan)->indexqualorig)
@@ -1716,6 +1741,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 										   planstate, es);
 			break;
 		case T_IndexOnlyScan:
+			if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyText("Skip scan", ((IndexOnlyScan *) plan)->indexdistinct ? "Distinct only" : "All", es);
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexOnlyScan *) plan)->indexqual)
@@ -1732,6 +1759,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 									 planstate->instrument->ntuples2, 0, es);
 			break;
 		case T_BitmapIndexScan:
+			if (((BitmapIndexScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyText("Skip scan", "All", es);
 			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
 			break;
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 642805d90c..0e77f241f9 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -133,6 +133,14 @@ ExecScanFetch(ScanState *node,
 	return (*accessMtd) (node);
 }
 
+TupleTableSlot *
+ExecScan(ScanState *node,
+		 ExecScanAccessMtd accessMtd,	/* function returning a tuple */
+		 ExecScanRecheckMtd recheckMtd)
+{
+	return ExecScanExtended(node, accessMtd, recheckMtd, NULL);
+}
+
 /* ----------------------------------------------------------------
  *		ExecScan
  *
@@ -155,9 +163,10 @@ ExecScanFetch(ScanState *node,
  * ----------------------------------------------------------------
  */
 TupleTableSlot *
-ExecScan(ScanState *node,
+ExecScanExtended(ScanState *node,
 		 ExecScanAccessMtd accessMtd,	/* function returning a tuple */
-		 ExecScanRecheckMtd recheckMtd)
+		 ExecScanRecheckMtd recheckMtd,
+		 ExecScanSkipMtd skipMtd)
 {
 	ExprContext *econtext;
 	ExprState  *qual;
@@ -170,6 +179,20 @@ ExecScan(ScanState *node,
 	projInfo = node->ps.ps_ProjInfo;
 	econtext = node->ps.ps_ExprContext;
 
+	if (skipMtd != NULL && node->ss_FirstTupleEmitted)
+	{
+		bool cont = skipMtd(node);
+		if (!cont)
+		{
+			node->ss_FirstTupleEmitted = false;
+			return ExecClearTuple(node->ss_ScanTupleSlot);
+		}
+	}
+	else
+	{
+		node->ss_FirstTupleEmitted = true;
+	}
+
 	/* interrupt checks are in ExecScanFetch */
 
 	/*
@@ -178,8 +201,13 @@ ExecScan(ScanState *node,
 	 */
 	if (!qual && !projInfo)
 	{
+		TupleTableSlot *slot;
+
 		ResetExprContext(econtext);
-		return ExecScanFetch(node, accessMtd, recheckMtd);
+		slot = ExecScanFetch(node, accessMtd, recheckMtd);
+		if (TupIsNull(slot))
+			node->ss_FirstTupleEmitted = false;
+		return slot;
 	}
 
 	/*
@@ -206,6 +234,7 @@ ExecScan(ScanState *node,
 		 */
 		if (TupIsNull(slot))
 		{
+			node->ss_FirstTupleEmitted = false;
 			if (projInfo)
 				return ExecClearTuple(projInfo->pi_state.resultslot);
 			else
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 81a1208157..602c64fc91 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -22,13 +22,14 @@
 #include "postgres.h"
 
 #include "access/genam.h"
+#include "access/relscan.h"
 #include "executor/execdebug.h"
 #include "executor/nodeBitmapIndexscan.h"
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
+#include "utils/rel.h"
 #include "utils/memutils.h"
 
-
 /* ----------------------------------------------------------------
  *		ExecBitmapIndexScan
  *
@@ -308,10 +309,20 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
 	/*
 	 * Initialize scan descriptor.
 	 */
-	indexstate->biss_ScanDesc =
-		index_beginscan_bitmap(indexstate->biss_RelationDesc,
-							   estate->es_snapshot,
-							   indexstate->biss_NumScanKeys);
+	if (node->indexskipprefixsize > 0)
+	{
+		indexstate->biss_ScanDesc =
+			index_beginscan_bitmap_skip(indexstate->biss_RelationDesc,
+				estate->es_snapshot,
+				indexstate->biss_NumScanKeys,
+				Min(IndexRelationGetNumberOfKeyAttributes(indexstate->biss_RelationDesc),
+					node->indexskipprefixsize));
+	}
+	else
+		indexstate->biss_ScanDesc =
+			index_beginscan_bitmap(indexstate->biss_RelationDesc,
+								   estate->es_snapshot,
+								   indexstate->biss_NumScanKeys);
 
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 5617ac29e7..aadba4a0fe 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -41,6 +41,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/itemptr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -49,6 +50,37 @@ static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup,
 							TupleDesc itupdesc);
 
+static bool
+IndexOnlySkip(IndexOnlyScanState *node)
+{
+	EState	   *estate;
+	ScanDirection direction;
+	IndexScanDesc scandesc;
+	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
+
+	if (!node->ioss_Distinct)
+		return true;
+
+	/*
+	 * extract necessary information from index scan node
+	 */
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	/* flip direction if this is an overall backward scan */
+	if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
+	{
+		if (ScanDirectionIsForward(direction))
+			direction = BackwardScanDirection;
+		else if (ScanDirectionIsBackward(direction))
+			direction = ForwardScanDirection;
+	}
+	scandesc = node->ioss_ScanDesc;
+
+	if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir))
+		return false;
+
+	return true;
+}
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -65,6 +97,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	ItemPointerData startTid;
+	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
 
 	/*
 	 * extract necessary information from index scan node
@@ -72,7 +106,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -90,11 +124,19 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * serially executing an index only scan that was planned to be
 		 * parallel.
 		 */
-		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->ioss_RelationDesc,
-								   estate->es_snapshot,
-								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+		if (node->ioss_SkipPrefixSize > 0)
+			scandesc = index_beginscan_skip(node->ss.ss_currentRelation,
+									   node->ioss_RelationDesc,
+									   estate->es_snapshot,
+									   node->ioss_NumScanKeys,
+									   node->ioss_NumOrderByKeys,
+									   Min(IndexRelationGetNumberOfKeyAttributes(node->ioss_RelationDesc), node->ioss_SkipPrefixSize));
+		else
+			scandesc = index_beginscan(node->ss.ss_currentRelation,
+									   node->ioss_RelationDesc,
+									   estate->es_snapshot,
+									   node->ioss_NumScanKeys,
+									   node->ioss_NumOrderByKeys);
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -114,11 +156,16 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
 	}
+	else
+	{
+		ItemPointerCopy(&scandesc->xs_heaptid, &startTid);
+	}
 
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while ((tid = node->ioss_SkipPrefixSize > 0 ? index_getnext_tid_skip(scandesc, direction, node->ioss_Distinct ? indexonlyscan->indexorderdir : direction) :
+			index_getnext_tid(scandesc, direction)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
@@ -314,9 +361,10 @@ ExecIndexOnlyScan(PlanState *pstate)
 	if (node->ioss_NumRuntimeKeys != 0 && !node->ioss_RuntimeKeysReady)
 		ExecReScan((PlanState *) node);
 
-	return ExecScan(&node->ss,
+	return ExecScanExtended(&node->ss,
 					(ExecScanAccessMtd) IndexOnlyNext,
-					(ExecScanRecheckMtd) IndexOnlyRecheck);
+					(ExecScanRecheckMtd) IndexOnlyRecheck,
+					node->ioss_Distinct ? (ExecScanSkipMtd) IndexOnlySkip : NULL);
 }
 
 /* ----------------------------------------------------------------
@@ -503,7 +551,10 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate = makeNode(IndexOnlyScanState);
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
+	indexstate->ss.ss_FirstTupleEmitted = false;
 	indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+	indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->ioss_Distinct = node->indexdistinct;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index d0a96a38e0..db3b5a3379 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -69,6 +69,37 @@ static void reorderqueue_push(IndexScanState *node, TupleTableSlot *slot,
 							  Datum *orderbyvals, bool *orderbynulls);
 static HeapTuple reorderqueue_pop(IndexScanState *node);
 
+static bool
+IndexSkip(IndexScanState *node)
+{
+	EState	   *estate;
+	ScanDirection direction;
+	IndexScanDesc scandesc;
+	IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
+
+	if (!node->iss_Distinct)
+		return true;
+
+	/*
+	 * extract necessary information from index scan node
+	 */
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	/* flip direction if this is an overall backward scan */
+	if (ScanDirectionIsBackward(indexscan->indexorderdir))
+	{
+		if (ScanDirectionIsForward(direction))
+			direction = BackwardScanDirection;
+		else if (ScanDirectionIsBackward(direction))
+			direction = ForwardScanDirection;
+	}
+	scandesc = node->iss_ScanDesc;
+
+	if (!index_skip(scandesc, direction, indexscan->indexorderdir))
+		return false;
+
+	return true;
+}
 
 /* ----------------------------------------------------------------
  *		IndexNext
@@ -85,6 +116,7 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
 
 	/*
 	 * extract necessary information from index scan node
@@ -92,7 +124,7 @@ IndexNext(IndexScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -109,14 +141,25 @@ IndexNext(IndexScanState *node)
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
 		 */
-		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
-								   estate->es_snapshot,
-								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+		if (node->iss_SkipPrefixSize > 0)
+			scandesc = index_beginscan_skip(node->ss.ss_currentRelation,
+									   node->iss_RelationDesc,
+									   estate->es_snapshot,
+									   node->iss_NumScanKeys,
+									   node->iss_NumOrderByKeys,
+									   Min(IndexRelationGetNumberOfKeyAttributes(node->iss_RelationDesc), node->iss_SkipPrefixSize));
+		else
+			scandesc = index_beginscan(node->ss.ss_currentRelation,
+									   node->iss_RelationDesc,
+									   estate->es_snapshot,
+									   node->iss_NumScanKeys,
+									   node->iss_NumOrderByKeys);
 
 		node->iss_ScanDesc = scandesc;
 
+		/* Index skip scan assumes xs_want_itup, so set it to true if we skip over distinct */
+		node->iss_ScanDesc->xs_want_itup = indexscan->indexdistinct;
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -130,7 +173,9 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (node->iss_SkipPrefixSize > 0 ?
+		   index_getnext_slot_skip(scandesc, direction, node->iss_Distinct ? indexscan->indexorderdir : direction, slot) :
+		   index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -530,13 +575,15 @@ ExecIndexScan(PlanState *pstate)
 		ExecReScan((PlanState *) node);
 
 	if (node->iss_NumOrderByKeys > 0)
-		return ExecScan(&node->ss,
+		return ExecScanExtended(&node->ss,
 						(ExecScanAccessMtd) IndexNextWithReorder,
-						(ExecScanRecheckMtd) IndexRecheck);
+						(ExecScanRecheckMtd) IndexRecheck,
+						node->iss_Distinct ? (ExecScanSkipMtd) IndexSkip : NULL);
 	else
-		return ExecScan(&node->ss,
+		return ExecScanExtended(&node->ss,
 						(ExecScanAccessMtd) IndexNext,
-						(ExecScanRecheckMtd) IndexRecheck);
+						(ExecScanRecheckMtd) IndexRecheck,
+						node->iss_Distinct ? (ExecScanSkipMtd) IndexSkip : NULL);
 }
 
 /* ----------------------------------------------------------------
@@ -910,6 +957,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+	indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->iss_Distinct = node->indexdistinct;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 1f50400fd2..6d67efe54e 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -493,6 +493,8 @@ _copyIndexScan(const IndexScan *from)
 	COPY_NODE_FIELD(indexorderbyorig);
 	COPY_NODE_FIELD(indexorderbyops);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
+	COPY_SCALAR_FIELD(indexdistinct);
 
 	return newnode;
 }
@@ -518,6 +520,8 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
 	COPY_NODE_FIELD(indexorderby);
 	COPY_NODE_FIELD(indextlist);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
+	COPY_SCALAR_FIELD(indexdistinct);
 
 	return newnode;
 }
@@ -542,6 +546,7 @@ _copyBitmapIndexScan(const BitmapIndexScan *from)
 	COPY_SCALAR_FIELD(isshared);
 	COPY_NODE_FIELD(indexqual);
 	COPY_NODE_FIELD(indexqualorig);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c3a9632992..5f395547f6 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -562,6 +562,8 @@ _outIndexScan(StringInfo str, const IndexScan *node)
 	WRITE_NODE_FIELD(indexorderbyorig);
 	WRITE_NODE_FIELD(indexorderbyops);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
+	WRITE_INT_FIELD(indexdistinct);
 }
 
 static void
@@ -576,6 +578,9 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
 	WRITE_NODE_FIELD(indexorderby);
 	WRITE_NODE_FIELD(indextlist);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
+	WRITE_INT_FIELD(indexdistinct);
+
 }
 
 static void
@@ -589,6 +594,7 @@ _outBitmapIndexScan(StringInfo str, const BitmapIndexScan *node)
 	WRITE_BOOL_FIELD(isshared);
 	WRITE_NODE_FIELD(indexqual);
 	WRITE_NODE_FIELD(indexqualorig);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 3a18571d0c..d0bcf8a391 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1829,6 +1829,8 @@ _readIndexScan(void)
 	READ_NODE_FIELD(indexorderbyorig);
 	READ_NODE_FIELD(indexorderbyops);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
+	READ_INT_FIELD(indexdistinct);
 
 	READ_DONE();
 }
@@ -1848,6 +1850,8 @@ _readIndexOnlyScan(void)
 	READ_NODE_FIELD(indexorderby);
 	READ_NODE_FIELD(indextlist);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
+	READ_INT_FIELD(indexdistinct);
 
 	READ_DONE();
 }
@@ -1866,6 +1870,7 @@ _readBitmapIndexScan(void)
 	READ_BOOL_FIELD(isshared);
 	READ_NODE_FIELD(indexqual);
 	READ_NODE_FIELD(indexqualorig);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 945aa93374..fe6ef62e8f 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -125,6 +125,7 @@ int			max_parallel_workers_per_gather = 2;
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
+bool		enable_indexskipscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index bcb1bc6097..216dd4611e 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -780,6 +780,16 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexPath  *ipath = (IndexPath *) lfirst(lc);
 
+		/*
+		 * To prevent unique paths from index skip scans being potentially used
+		 * when not needed scan keep them in a separate pathlist.
+		*/
+		if (ipath->indexdistinct)
+		{
+			add_unique_path(rel, (Path *) ipath);
+			continue;
+		}
+
 		if (index->amhasgettuple)
 			add_path(rel, (Path *) ipath);
 
@@ -868,6 +878,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
+	bool		can_skip;
 	int			indexcol;
 
 	/*
@@ -1017,6 +1028,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	index_only_scan = (scantype != ST_BITMAPSCAN &&
 					   check_index_only(rel, index));
 
+	/* Check if an index skip scan is possible. */
+	can_skip = enable_indexskipscan & index->amcanskip;
+
 	/*
 	 * 4. Generate an indexscan path if there are relevant restriction clauses
 	 * in the current clauses, OR the index ordering is potentially useful for
@@ -1040,6 +1054,33 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  false);
 		result = lappend(result, ipath);
 
+		/* Consider index skip scan as well */
+		if (root->query_uniquekeys != NULL && can_skip)
+		{
+			int numusefulkeys = list_length(useful_pathkeys);
+			int numsortkeys = list_length(root->query_pathkeys);
+
+			if (numusefulkeys == numsortkeys)
+			{
+				int prefix;
+				if (list_length(root->distinct_pathkeys) > 0)
+					prefix = find_index_prefix_for_pathkey(root,
+														   index,
+														   ForwardScanDirection,
+														   llast_node(PathKey,
+														   root->distinct_pathkeys));
+				else
+					/* all are distinct keys are constant and optimized away.
+					 * skipping with 1 is sufficient as all are constant anyway
+					 */
+					prefix = 1;
+
+				result = lappend(result,
+								 create_skipscan_unique_path(root, index,
+															 (Path *) ipath, prefix));
+			}
+		}
+
 		/*
 		 * If appropriate, consider parallel index scan.  We don't allow
 		 * parallel index scan for bitmap index scans.
@@ -1095,6 +1136,33 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  false);
 			result = lappend(result, ipath);
 
+			/* Consider index skip scan as well */
+			if (root->query_uniquekeys != NULL && can_skip)
+			{
+				int numusefulkeys = list_length(useful_pathkeys);
+				int numsortkeys = list_length(root->query_pathkeys);
+
+				if (numusefulkeys == numsortkeys)
+				{
+					int prefix;
+					if (list_length(root->distinct_pathkeys) > 0)
+						prefix = find_index_prefix_for_pathkey(root,
+															   index,
+															   BackwardScanDirection,
+															   llast_node(PathKey,
+															   root->distinct_pathkeys));
+					else
+						/* all are distinct keys are constant and optimized away.
+						 * skipping with 1 is sufficient as all are constant anyway
+						 */
+						prefix = 1;
+
+					result = lappend(result,
+									 create_skipscan_unique_path(root, index,
+																 (Path *) ipath, prefix));
+				}
+			}
+
 			/* If appropriate, consider parallel index scan */
 			if (index->amcanparallel &&
 				rel->consider_parallel && outer_relids == NULL &&
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index a4fc4f252d..3fa533be95 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -522,6 +522,78 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  *		NEW PATHKEY FORMATION
  ****************************************************************************/
 
+/*
+ * Find the prefix size for a specific path key in an index.
+ * For example, an index with (a,b,c) finding path key b will
+ * return prefix 2.
+ * Returns 0 when not found.
+ */
+int
+find_index_prefix_for_pathkey(PlannerInfo *root,
+					 IndexOptInfo *index,
+					 ScanDirection scandir,
+					 PathKey *pathkey)
+{
+	ListCell   *lc;
+	int			i;
+
+	i = 0;
+	foreach(lc, index->indextlist)
+	{
+		TargetEntry *indextle = (TargetEntry *) lfirst(lc);
+		Expr	   *indexkey;
+		bool		reverse_sort;
+		bool		nulls_first;
+		PathKey    *cpathkey;
+
+		/*
+		 * INCLUDE columns are stored in index unordered, so they don't
+		 * support ordered index scan.
+		 */
+		if (i >= index->nkeycolumns)
+			break;
+
+		/* We assume we don't need to make a copy of the tlist item */
+		indexkey = indextle->expr;
+
+		if (ScanDirectionIsBackward(scandir))
+		{
+			reverse_sort = !index->reverse_sort[i];
+			nulls_first = !index->nulls_first[i];
+		}
+		else
+		{
+			reverse_sort = index->reverse_sort[i];
+			nulls_first = index->nulls_first[i];
+		}
+
+		/*
+		 * OK, try to make a canonical pathkey for this sort key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  indexkey,
+											  NULL,
+											  index->sortopfamily[i],
+											  index->opcintype[i],
+											  index->indexcollations[i],
+											  reverse_sort,
+											  nulls_first,
+											  0,
+											  index->rel->relids,
+											  false);
+
+		if (cpathkey == pathkey)
+		{
+			return i + 1;
+		}
+
+		i++;
+	}
+
+	return 0;
+}
+
 /*
  * build_index_pathkeys
  *	  Build a pathkeys list that describes the ordering induced by an index
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 99278eed93..f6caf8b3e1 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -177,15 +177,20 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 								 Oid indexid, List *indexqual, List *indexqualorig,
 								 List *indexorderby, List *indexorderbyorig,
 								 List *indexorderbyops,
-								 ScanDirection indexscandir);
+								 ScanDirection indexscandir,
+								 int skipprefix,
+								 bool distinct);
 static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 Index scanrelid, Oid indexid,
 										 List *indexqual, List *indexorderby,
 										 List *indextlist,
-										 ScanDirection indexscandir);
+										 ScanDirection indexscandir,
+										 int skipprefix,
+										 bool distinct);
 static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
 											  List *indexqual,
-											  List *indexqualorig);
+											  List *indexqualorig,
+											  int skipPrefixSize);
 static BitmapHeapScan *make_bitmap_heapscan(List *qptlist,
 											List *qpqual,
 											Plan *lefttree,
@@ -2987,7 +2992,9 @@ create_indexscan_plan(PlannerInfo *root,
 												fixed_indexquals,
 												fixed_indexorderbys,
 												best_path->indexinfo->indextlist,
-												best_path->indexscandir);
+												best_path->indexscandir,
+												best_path->indexskipprefix,
+												best_path->indexdistinct);
 	else
 		scan_plan = (Scan *) make_indexscan(tlist,
 											qpqual,
@@ -2998,7 +3005,9 @@ create_indexscan_plan(PlannerInfo *root,
 											fixed_indexorderbys,
 											indexorderbys,
 											indexorderbyops,
-											best_path->indexscandir);
+											best_path->indexscandir,
+											best_path->indexskipprefix,
+											best_path->indexdistinct);
 
 	copy_generic_path_info(&scan_plan->plan, &best_path->path);
 
@@ -3288,7 +3297,8 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		plan = (Plan *) make_bitmap_indexscan(iscan->scan.scanrelid,
 											  iscan->indexid,
 											  iscan->indexqual,
-											  iscan->indexqualorig);
+											  iscan->indexqualorig,
+											  iscan->indexskipprefixsize);
 		/* and set its cost/width fields appropriately */
 		plan->startup_cost = 0.0;
 		plan->total_cost = ipath->indextotalcost;
@@ -5265,7 +5275,9 @@ make_indexscan(List *qptlist,
 			   List *indexorderby,
 			   List *indexorderbyorig,
 			   List *indexorderbyops,
-			   ScanDirection indexscandir)
+			   ScanDirection indexscandir,
+			   int skipPrefixSize,
+			   bool distinct)
 {
 	IndexScan  *node = makeNode(IndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5282,6 +5294,8 @@ make_indexscan(List *qptlist,
 	node->indexorderbyorig = indexorderbyorig;
 	node->indexorderbyops = indexorderbyops;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
+	node->indexdistinct = distinct;
 
 	return node;
 }
@@ -5294,7 +5308,9 @@ make_indexonlyscan(List *qptlist,
 				   List *indexqual,
 				   List *indexorderby,
 				   List *indextlist,
-				   ScanDirection indexscandir)
+				   ScanDirection indexscandir,
+				   int skipPrefixSize,
+				   bool distinct)
 {
 	IndexOnlyScan *node = makeNode(IndexOnlyScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5309,6 +5325,8 @@ make_indexonlyscan(List *qptlist,
 	node->indexorderby = indexorderby;
 	node->indextlist = indextlist;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
+	node->indexdistinct = distinct;
 
 	return node;
 }
@@ -5317,7 +5335,8 @@ static BitmapIndexScan *
 make_bitmap_indexscan(Index scanrelid,
 					  Oid indexid,
 					  List *indexqual,
-					  List *indexqualorig)
+					  List *indexqualorig,
+					  int skipPrefixSize)
 {
 	BitmapIndexScan *node = makeNode(BitmapIndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5330,6 +5349,7 @@ make_bitmap_indexscan(Index scanrelid,
 	node->indexid = indexid;
 	node->indexqual = indexqual;
 	node->indexqualorig = indexqualorig;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 58386d5040..6a5259926d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3633,12 +3633,18 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 
 	if (parse->distinctClause &&
 		grouping_is_sortable(parse->distinctClause))
+	{
 		root->distinct_pathkeys =
 			make_pathkeys_for_sortclauses(root,
 										  parse->distinctClause,
 										  tlist);
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
+	{
 		root->distinct_pathkeys = NIL;
+		root->query_uniquekeys = NIL;
+	}
 
 	root->sort_pathkeys =
 		make_pathkeys_for_sortclauses(root,
@@ -4832,7 +4838,7 @@ create_distinct_paths(PlannerInfo *root,
 		{
 			Path	   *path = (Path *) lfirst(lc);
 
-			if (query_has_uniquekeys_for(root, needed_pathkeys, false))
+			if (query_has_uniquekeys_for(root, path->uniquekeys, false))
 				add_path(distinct_rel, path);
 		}
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d4abb3cb47..79af18e7a7 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3003,6 +3003,44 @@ create_upper_unique_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_skipscan_unique_path
+ *	  Creates a pathnode the same as an existing IndexPath except based on
+ *	  skipping duplicate values.  This may or may not be cheaper than using
+ *	  create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root, IndexOptInfo *index,
+							Path *basepath, int prefix)
+{
+	IndexPath 	*pathnode = makeNode(IndexPath);
+	int 		numDistinctRows;
+	UniqueKey *ukey;
+
+	Assert(IsA(basepath, IndexPath));
+
+	/* We don't want to modify basepath, so make a copy. */
+	memcpy(pathnode, basepath, sizeof(IndexPath));
+
+	ukey = linitial_node(UniqueKey, root->query_uniquekeys);
+
+	Assert(prefix > 0);
+	pathnode->indexskipprefix = prefix;
+	pathnode->indexdistinct = true;
+	pathnode->path.uniquekeys = root->query_uniquekeys;
+
+	numDistinctRows = estimate_num_groups(root, ukey->exprs,
+										  pathnode->path.rows,
+										  NULL);
+
+	pathnode->path.total_cost = pathnode->path.startup_cost * numDistinctRows;
+	pathnode->path.rows = numDistinctRows;
+
+	return pathnode;
+}
+
 /*
  * create_agg_path
  *	  Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 0b2f9d398a..6fb3648d0f 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -272,6 +272,9 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanskip = (amroutine->amskip != NULL &&
+					amroutine->amgetskiptuple != NULL &&
+					amroutine->ambeginskipscan != NULL);
 			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6f603cbbe8..b04ef1811d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -957,6 +957,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-skip-scan plans."),
+			NULL
+		},
+		&enable_indexskipscan,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5a0b8e9821..947dff83a5 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -357,6 +357,7 @@
 #enable_hashjoin = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexskipscan = on
 #enable_material = on
 #enable_mergejoin = on
 #enable_nestloop = on
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 3c49476483..52432cb24d 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -991,7 +991,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, NULL, NULL);
 
 	if (state->indexInfo->ii_Expressions != NULL)
 	{
@@ -1086,7 +1086,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->indexRel = indexRel;
 	state->enforceUnique = enforceUnique;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, NULL, NULL);
 
 	/* Prepare SortSupport data for each column */
 	state->sortKeys = (SortSupport) palloc0(state->nKeys *
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 4325faa460..7de854649a 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -119,6 +119,12 @@ typedef IndexScanDesc (*ambeginscan_function) (Relation indexRelation,
 											   int nkeys,
 											   int norderbys);
 
+/* prepare for index scan with skip */
+typedef IndexScanDesc (*ambeginscan_skip_function) (Relation indexRelation,
+											   int nkeys,
+											   int norderbys,
+											   int prefix);
+
 /* (re)start index scan */
 typedef void (*amrescan_function) (IndexScanDesc scan,
 								   ScanKey keys,
@@ -130,6 +136,16 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next valid tuple */
+typedef bool (*amgettuple_with_skip_function) (IndexScanDesc scan,
+											   ScanDirection prefixDir,
+											   ScanDirection postfixDir);
+
+/* skip past duplicates */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+								 ScanDirection prefixDir,
+								 ScanDirection postfixDir);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -225,12 +241,15 @@ typedef struct IndexAmRoutine
 	ambuildphasename_function ambuildphasename; /* can be NULL */
 	amvalidate_function amvalidate;
 	ambeginscan_function ambeginscan;
+	ambeginscan_skip_function ambeginskipscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgettuple_with_skip_function amgetskiptuple; /* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
 	amrestrpos_function amrestrpos; /* can be NULL */
+	amskip_function amskip;				/* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 931257bd81..2949d457b6 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -149,9 +149,17 @@ extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
 									 int nkeys, int norderbys);
+extern IndexScanDesc index_beginscan_skip(Relation heapRelation,
+									 Relation indexRelation,
+									 Snapshot snapshot,
+									 int nkeys, int norderbys, int prefix);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
+extern IndexScanDesc index_beginscan_bitmap_skip(Relation indexRelation,
+											Snapshot snapshot,
+											int nkeys,
+											int prefix);
 extern void index_rescan(IndexScanDesc scan,
 						 ScanKey keys, int nkeys,
 						 ScanKey orderbys, int norderbys);
@@ -167,10 +175,16 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
+extern ItemPointer index_getnext_tid_skip(IndexScanDesc scan,
+									 ScanDirection prefixDir,
+									 ScanDirection postfixDir);
 struct TupleTableSlot;
 extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 							   struct TupleTableSlot *slot);
+extern bool index_getnext_slot_skip(IndexScanDesc scan, ScanDirection prefixDir,
+									ScanDirection postfixDir,
+									struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
@@ -180,6 +194,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
 												   IndexBulkDeleteResult *stats);
 extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection prefixDir,
+					   ScanDirection postfixDir);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index f5274cc750..1c5ff49ed6 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -901,6 +901,54 @@ typedef struct BTArrayKeyInfo
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
 
+typedef struct BTSkipCompareResult
+{
+	bool		equal;
+	int			prefixCmpResult, skCmpResult;
+	bool		prefixSkip, fullKeySkip;
+	int			prefixSkipIndex;
+} BTSkipCompareResult;
+
+typedef enum BTSkipState
+{
+	SkipStateStop,
+	SkipStateSkip,
+	SkipStateSkipExtra,
+	SkipStateNext
+} BTSkipState;
+
+typedef struct BTSkipPosData
+{
+	BTSkipState nextAction;
+	ScanDirection nextDirection;
+	int nextSkipIndex;
+	BTScanInsertData skipScanKey;
+	char skipTuple[BLCKSZ]; /* tuple data where skipScanKey Datum's point to */
+} BTSkipPosData;
+
+typedef struct BTSkipData
+{
+	/* used to control skipping
+	 * curPos.skipScanKey is a combination of currentTupleKey and fwdScanKey/bwdScanKey.
+	 * currentTupleKey contains the scan keys for the current tuple
+	 * fwdScanKey contains the scan keys for quals that would be chosen for a forward scan
+	 * bwdScanKey contains the scan keys for quals that would be chosen for a backward scan
+	 * we need both fwd and bwd, because the scan keys differ for going fwd and bwd
+	 * if a qual would be a>2 and a<5, fwd would have a>2, while bwd would have a<5
+	 */
+	BTScanInsertData	currentTupleKey;
+	BTScanInsertData	fwdScanKey;
+	ScanKeyData			fwdNotNullKeys[INDEX_MAX_KEYS];
+	BTScanInsertData	bwdScanKey;
+	ScanKeyData			bwdNotNullKeys[INDEX_MAX_KEYS];
+	/* length of prefix to skip */
+	int					prefix;
+
+	BTSkipPosData curPos, markPos;
+} BTSkipData;
+
+typedef BTSkipData *BTSkip;
+
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -938,6 +986,9 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* Work space for _bt_skip */
+	BTSkip	skipData;	/* used to control skipping */
+
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -952,6 +1003,8 @@ typedef BTScanOpaqueData *BTScanOpaque;
  */
 #define SK_BT_REQFWD	0x00010000	/* required to continue forward scan */
 #define SK_BT_REQBKWD	0x00020000	/* required to continue backward scan */
+#define SK_BT_REQSKIPFWD	0x00040000	/* required to continue forward scan within current prefix */
+#define SK_BT_REQSKIPBKWD	0x00080000	/* required to continue backward scan within current prefix */
 #define SK_BT_INDOPTION_SHIFT  24	/* must clear the above bits */
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -998,9 +1051,12 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 					 IndexUniqueCheck checkUnique,
 					 struct IndexInfo *indexInfo);
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc btbeginscan_skip(Relation rel, int nkeys, int norderbys, int skipPrefix);
 extern Size btestimateparallelscan(void);
 extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool btgettuple_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
+extern bool btskip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
@@ -1097,15 +1153,81 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
 							bool forupdate, BTStack stack, int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_first(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
+extern bool _bt_next(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 							   Snapshot snapshot);
+extern Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+extern void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+extern bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
+						 OffsetNumber *offnum, bool isRegularMode);
+extern bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
+extern void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+
+/*
+* prototypes for functions in nbtskip.c
+*/
+static inline bool
+_bt_skip_enabled(BTScanOpaque so)
+{
+	return so->skipData != NULL;
+}
+
+static inline bool
+_bt_skip_is_regular_mode(ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	return prefixDir == postfixDir;
+}
+
+/* returns whether or not we can use extra quals in the scankey after skipping to a prefix */
+static inline bool
+_bt_has_extra_quals_after_skip(BTSkip skip, ScanDirection dir, int prefix)
+{
+	if (ScanDirectionIsForward(dir))
+	{
+		return skip->fwdScanKey.keysz > prefix;
+	}
+	else
+	{
+		return skip->bwdScanKey.keysz > prefix;
+	}
+}
+
+/* alias of BTScanPosIsValid */
+static inline bool
+_bt_skip_is_always_valid(BTScanOpaque so)
+{
+	return BTScanPosIsValid(so->currPos);
+}
+
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
+extern void _bt_skip_create_scankeys(Relation rel, BTScanOpaque so);
+extern void _bt_skip_update_scankey_for_extra_skip(IndexScanDesc scan, Relation indexRel,
+					ScanDirection curDir, ScanDirection prefixDir, bool prioritizeEqual, IndexTuple itup);
+extern void _bt_skip_once(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+						  bool forceSkip, ScanDirection prefixDir, ScanDirection postfixDir);
+extern void _bt_skip_extra_conditions(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+									  ScanDirection prefixDir, ScanDirection postfixDir, BTSkipCompareResult *cmp);
+extern bool _bt_skip_find_next(IndexScanDesc scan, IndexTuple curTuple, OffsetNumber curTupleOffnum,
+							   ScanDirection prefixDir, ScanDirection postfixDir);
+extern void _bt_skip_until_match(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+								 ScanDirection prefixDir, ScanDirection postfixDir);
+extern bool _bt_has_results(BTScanOpaque so);
+extern void _bt_compare_current_item(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+									 ScanDirection dir, bool isRegularMode, BTSkipCompareResult* cmp);
+extern bool _bt_step_back_page(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum);
+extern bool _bt_step_forward_page(IndexScanDesc scan, BlockNumber next, IndexTuple *curTuple,
+								  OffsetNumber *curTupleOffnum);
+extern bool _bt_checkkeys_skip(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+							   ScanDirection dir, bool *continuescan, int *prefixskipindex);
+extern IndexTuple
+_bt_get_tuple_from_offset(BTScanOpaque so, OffsetNumber curTupleOffnum);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -1114,7 +1236,7 @@ extern void _bt_mark_array_keys(IndexScanDesc scan);
 extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
 extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan);
+						  int tupnatts, ScanDirection dir, bool *continuescan, int *indexSkipPrefix);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
@@ -1136,6 +1258,19 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
 extern bool _bt_allequalimage(Relation rel, bool debugmessage);
+extern bool _bt_checkkeys_threeway(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+				ScanDirection dir, bool *continuescan, int *prefixSkipIndex);
+extern bool _bt_create_insertion_scan_key(Relation	rel, ScanDirection dir,
+				ScanKey* startKeys, int keysCount,
+				BTScanInsert inskey, StrategyNumber* stratTotal,
+				bool* goback);
+extern void _bt_set_bsearch_flags(StrategyNumber stratTotal, ScanDirection dir,
+		bool* nextkey, bool* goback);
+extern int _bt_choose_scan_keys(ScanKey scanKeys, int numberOfKeys, ScanDirection dir,
+ScanKey* startKeys, ScanKeyData* notnullkeys,
+  StrategyNumber* stratTotal, int prefix);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup, BTScanInsert key);
+extern void print_itup(BlockNumber blk, IndexTuple left, IndexTuple right, Relation rel, char *extra);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c7deeac662..fe7729bf75 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -429,9 +429,13 @@ extern Datum ExecMakeFunctionResultSet(SetExprState *fcache,
  */
 typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
 typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
+typedef bool (*ExecScanSkipMtd) (ScanState *node);
 
 extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
 								ExecScanRecheckMtd recheckMtd);
+extern TupleTableSlot *ExecScanExtended(ScanState *node, ExecScanAccessMtd accessMtd,
+								ExecScanRecheckMtd recheckMtd,
+								ExecScanSkipMtd skipMtd);
 extern void ExecAssignScanProjectionInfo(ScanState *node);
 extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, Index varno);
 extern void ExecScanReScan(ScanState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6f96b31fb4..03fd532794 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1333,6 +1333,7 @@ typedef struct ScanState
 	Relation	ss_currentRelation;
 	struct TableScanDescData *ss_currentScanDesc;
 	TupleTableSlot *ss_ScanTupleSlot;
+	bool ss_FirstTupleEmitted;
 } ScanState;
 
 /* ----------------
@@ -1429,6 +1430,8 @@ typedef struct IndexScanState
 	ExprContext *iss_RuntimeContext;
 	Relation	iss_RelationDesc;
 	struct IndexScanDescData *iss_ScanDesc;
+	int			iss_SkipPrefixSize;
+	bool		iss_Distinct;
 
 	/* These are needed for re-checking ORDER BY expr ordering */
 	pairingheap *iss_ReorderQueue;
@@ -1458,6 +1461,8 @@ typedef struct IndexScanState
  *		TableSlot		   slot for holding tuples fetched from the table
  *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
+ *		SkipPrefixSize	   number of keys for skip-based DISTINCT
+ *		FirstTupleEmitted  has the first tuple been emitted
  * ----------------
  */
 typedef struct IndexOnlyScanState
@@ -1476,6 +1481,8 @@ typedef struct IndexOnlyScanState
 	struct IndexScanDescData *ioss_ScanDesc;
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
+	int			ioss_SkipPrefixSize;
+	bool		ioss_Distinct;
 	Size		ioss_PscanLen;
 } IndexOnlyScanState;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index a5c406bd4e..e5101328b7 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1229,6 +1229,9 @@ typedef struct Path
  * we need not recompute them when considering using the same index in a
  * bitmap index/heap scan (see BitmapHeapPath).  The costs of the IndexPath
  * itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
  *----------
  */
 typedef struct IndexPath
@@ -1241,6 +1244,8 @@ typedef struct IndexPath
 	ScanDirection indexscandir;
 	Cost		indextotalcost;
 	Selectivity indexselectivity;
+	int			indexskipprefix;
+	bool		indexdistinct;
 } IndexPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 83e01074ed..f209b7963b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -409,6 +409,8 @@ typedef struct IndexScan
 	List	   *indexorderbyorig;	/* the same in original form */
 	List	   *indexorderbyops;	/* OIDs of sort ops for ORDER BY exprs */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for skip scans */
+	bool		indexdistinct; /* whether only distinct keys are requested */
 } IndexScan;
 
 /* ----------------
@@ -436,6 +438,8 @@ typedef struct IndexOnlyScan
 	List	   *indexorderby;	/* list of index ORDER BY exprs */
 	List	   *indextlist;		/* TargetEntry list describing index's cols */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for skip scans */
+	bool		indexdistinct; /* whether only distinct keys are requested */
 } IndexOnlyScan;
 
 /* ----------------
@@ -462,6 +466,7 @@ typedef struct BitmapIndexScan
 	bool		isshared;		/* Create shared bitmap if set */
 	List	   *indexqual;		/* list of index quals (OpExprs) */
 	List	   *indexqualorig;	/* the same in original form */
+	int			indexskipprefixsize;	/* the size of the prefix for skip scans */
 } BitmapIndexScan;
 
 /* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 613db8eab6..c0f176eaaa 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
 extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 6796ad8cb7..8ec1780a56 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -207,6 +207,10 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
 												 Path *subpath,
 												 int numCols,
 												 double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+											  IndexOptInfo *index,
+											  Path *subpath,
+											  int prefix);
 extern AggPath *create_agg_path(PlannerInfo *root,
 								RelOptInfo *rel,
 								Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 0cb8030e33..f934f0011a 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -198,6 +198,10 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   Relids required_outer,
 													   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
+extern int find_index_prefix_for_pathkey(PlannerInfo *root,
+					 IndexOptInfo *index,
+					 ScanDirection scandir,
+					 PathKey *pathkey);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 								  ScanDirection scandir);
 extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
diff --git a/src/interfaces/libpq/encnames.c b/src/interfaces/libpq/encnames.c
new file mode 120000
index 0000000000..ca78618b55
--- /dev/null
+++ b/src/interfaces/libpq/encnames.c
@@ -0,0 +1 @@
+../../../src/backend/utils/mb/encnames.c
\ No newline at end of file
diff --git a/src/interfaces/libpq/wchar.c b/src/interfaces/libpq/wchar.c
new file mode 120000
index 0000000000..a27508f72a
--- /dev/null
+++ b/src/interfaces/libpq/wchar.c
@@ -0,0 +1 @@
+../../../src/backend/utils/mb/wchar.c
\ No newline at end of file
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index 11c6f50fbf..bdcf75fe8e 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -306,3 +306,602 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
  t
 (1 row)
 
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+SELECT DISTINCT a FROM distinct_a;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+ a 
+---
+ 1
+(1 row)
+
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Index Only Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: Distinct only
+(2 rows)
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: Distinct only
+   Index Cond: (b = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: Distinct only
+   Index Cond: (b = 2)
+(3 rows)
+
+DROP INDEX distinct_a_b_a;
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b 
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+                          QUERY PLAN                          
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: Distinct only
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+                              QUERY PLAN                               
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: Distinct only
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+ 2 | 1 | 10
+ 3 | 1 | 10
+ 4 | 1 | 10
+ 5 | 1 | 10
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: Distinct only
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: Distinct only
+   Index Cond: (a = 1)
+(3 rows)
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: Distinct only
+   Index Cond: (b = 2)
+   Filter: (c = 10)
+(4 rows)
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+ a | a 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 3
+ 4 | 4
+ 5 | 5
+(5 rows)
+
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+ a | ?column? 
+---+----------
+ 1 |        1
+ 2 |        1
+ 3 |        1
+ 4 |        1
+ 5 |        1
+(5 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+FETCH FROM c;
+ a 
+---
+ 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a 
+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 |  9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+                                 QUERY PLAN                                 
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+   Skip scan: Distinct only
+   Index Cond: ((b >= 1) AND (c = 0))
+(3 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c 
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
+-- test tuple killing
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 1 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 5 | 1 | 1 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 5 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 1 | 1 | 1 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(5 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(5 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 06c4c3e476..e64e20a8cb 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -79,6 +79,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexskipscan           | on
  enable_material                | on
  enable_mergejoin               | on
  enable_nestloop                | on
@@ -90,7 +91,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(18 rows)
+(19 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index 33102744eb..0227c98823 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -135,3 +135,251 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
 SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
 SELECT 2 IS NOT DISTINCT FROM null as "no";
 SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+
+SELECT DISTINCT a FROM distinct_a;
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+DROP INDEX distinct_a_b_a;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
+
+-- test tuple killing
+
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
-- 
2.27.0

v02-0003-Support-skip-scan-for-non-distinct-scans.patchapplication/octet-stream; name=v02-0003-Support-skip-scan-for-non-distinct-scans.patchDownload

From a94ac1a4ae61aca08246417d54d2e6ad3a22487c Mon Sep 17 00:00:00 2001
From: Floris van Nee <floris.vannee@gmail.com>
Date: Thu, 19 Mar 2020 10:27:47 +0100
Subject: [PATCH 5/5] Support skip scan for non-distinct scans

Adds planner support to choose a skip scan for regular
non-distinct queries like:
SELECT * FROM t1 WHERE b=1 (with index on (a,b))
---
 src/backend/optimizer/path/indxpath.c | 181 +++++++++++++++++++++++++-
 src/backend/optimizer/plan/planner.c  |   2 +-
 src/backend/optimizer/util/pathnode.c |   4 +-
 src/backend/utils/adt/selfuncs.c      | 153 ++++++++++++++++++++--
 src/include/optimizer/pathnode.h      |   3 +-
 5 files changed, 327 insertions(+), 16 deletions(-)

diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 216dd4611e..9bf7bd73f1 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -188,6 +188,17 @@ static Expr *match_clause_to_ordering_op(IndexOptInfo *index,
 static bool ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel,
 									   EquivalenceClass *ec, EquivalenceMember *em,
 									   void *arg);
+static List* add_possible_index_skip_paths(List* result,
+										  PlannerInfo *root,
+										  IndexOptInfo *index,
+										  List *indexclauses,
+										  List *indexorderbys,
+										  List *indexorderbycols,
+										  List *pathkeys,
+										  ScanDirection indexscandir,
+										  bool indexonly,
+										  Relids required_outer,
+										  double loop_count);
 
 
 /*
@@ -816,6 +827,136 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	}
 }
 
+/*
+ * Find available index skip paths and add them to the path list
+ */
+static List* add_possible_index_skip_paths(List* result,
+										  PlannerInfo *root,
+										  IndexOptInfo *index,
+										  List *indexclauses,
+										  List *indexorderbys,
+										  List *indexorderbycols,
+										  List *pathkeys,
+										  ScanDirection indexscandir,
+										  bool indexonly,
+										  Relids required_outer,
+										  double loop_count)
+{
+	int			indexcol;
+	bool		eqQualHere;
+	bool		eqQualPrev;
+	bool		eqSoFar;
+	ListCell   *lc;
+
+	/*
+	 * We need to find possible prefixes to use for the skip scan
+	 * Any useful prefix is one just before an index clause, unless
+	 * all clauses so far have been equal.
+	 * For example, on an index (a,b,c), the qual b=1 would
+	 * mean that an interesting skip prefix could be 1.
+	 * For qual a=1 AND b=1, it is not interesting to skip with
+	 * prefix 1, because the value of a is fixed already.
+	 */
+	indexcol = 0;
+	eqQualHere = false;
+	eqQualPrev = false;
+	eqSoFar = true;
+	foreach(lc, indexclauses)
+	{
+		IndexClause *iclause = lfirst_node(IndexClause, lc);
+		ListCell   *lc2;
+
+		if (indexcol != iclause->indexcol)
+		{
+			if (!eqQualHere)
+				eqSoFar = false;
+
+			/* Beginning of a new column's quals */
+			if (!eqQualPrev && !eqSoFar)
+			{
+				/* We have a qual on current column,
+				 * there is no equality qual on the previous column,
+				 * not all of the previous quals are equality so far
+				 * (last one is special case for the first column in the index).
+				 * Optimal conditions to try an index skip path.
+				 */
+				IndexPath *ipath = create_index_path(root, index,
+										  indexclauses,
+										  indexorderbys,
+										  indexorderbycols,
+										  pathkeys,
+										  indexscandir,
+										  indexonly,
+										  required_outer,
+										  loop_count,
+										  false,
+										  iclause->indexcol);
+				result = lappend(result, ipath);
+			}
+
+			eqQualPrev = eqQualHere;
+			eqQualHere = false;
+			indexcol++;
+			/* if the clause is not for this index col, increment until it is */
+			while (indexcol != iclause->indexcol)
+			{
+				eqQualPrev = false;
+				eqSoFar = false;
+				indexcol++;
+			}
+		}
+
+		/* Examine each indexqual associated with this index clause */
+		foreach(lc2, iclause->indexquals)
+		{
+			RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc2);
+			Expr	   *clause = rinfo->clause;
+			Oid			clause_op = InvalidOid;
+			int			op_strategy;
+
+			if (IsA(clause, OpExpr))
+			{
+				OpExpr	   *op = (OpExpr *) clause;
+				clause_op = op->opno;
+			}
+			else if (IsA(clause, RowCompareExpr))
+			{
+				RowCompareExpr *rc = (RowCompareExpr *) clause;
+				clause_op = linitial_oid(rc->opnos);
+			}
+			else if (IsA(clause, ScalarArrayOpExpr))
+			{
+				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
+				clause_op = saop->opno;
+			}
+			else if (IsA(clause, NullTest))
+			{
+				NullTest   *nt = (NullTest *) clause;
+
+				if (nt->nulltesttype == IS_NULL)
+				{
+					/* IS NULL is like = for selectivity purposes */
+					eqQualHere = true;
+				}
+			}
+			else
+				elog(ERROR, "unsupported indexqual type: %d",
+					 (int) nodeTag(clause));
+
+			/* check for equality operator */
+			if (OidIsValid(clause_op))
+			{
+				op_strategy = get_op_opfamily_strategy(clause_op,
+													   index->opfamily[indexcol]);
+				Assert(op_strategy != 0);	/* not a member of opfamily?? */
+				if (op_strategy == BTEqualStrategyNumber)
+					eqQualHere = true;
+			}
+		}
+	}
+	return result;
+}
+
 /*
  * build_index_paths
  *	  Given an index and a set of index clauses for it, construct zero
@@ -1051,9 +1192,25 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  index_only_scan,
 								  outer_relids,
 								  loop_count,
-								  false);
+								  false,
+								  0);
 		result = lappend(result, ipath);
 
+		if (can_skip)
+		{
+			result = add_possible_index_skip_paths(result, root, index,
+												   index_clauses,
+												   orderbyclauses,
+												   orderbyclausecols,
+												   useful_pathkeys,
+												   index_is_ordered ?
+												   ForwardScanDirection :
+												   NoMovementScanDirection,
+												   index_only_scan,
+												   outer_relids,
+												   loop_count);
+		}
+
 		/* Consider index skip scan as well */
 		if (root->query_uniquekeys != NULL && can_skip)
 		{
@@ -1100,7 +1257,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  index_only_scan,
 									  outer_relids,
 									  loop_count,
-									  true);
+									  true,
+									  0);
 
 			/*
 			 * if, after costing the path, we find that it's not worth using
@@ -1133,9 +1291,23 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  index_only_scan,
 									  outer_relids,
 									  loop_count,
-									  false);
+									  false,
+									  0);
 			result = lappend(result, ipath);
 
+			if (can_skip)
+			{
+				result = add_possible_index_skip_paths(result, root, index,
+													   index_clauses,
+													   NIL,
+													   NIL,
+													   useful_pathkeys,
+													   BackwardScanDirection,
+													   index_only_scan,
+													   outer_relids,
+													   loop_count);
+			}
+
 			/* Consider index skip scan as well */
 			if (root->query_uniquekeys != NULL && can_skip)
 			{
@@ -1177,7 +1349,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 										  index_only_scan,
 										  outer_relids,
 										  loop_count,
-										  true);
+										  true,
+										  0);
 
 				/*
 				 * if, after costing the path, we find that it's not worth
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 6a5259926d..da2bdcd8d3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6356,7 +6356,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 	indexScanPath = create_index_path(root, indexInfo,
 									  NIL, NIL, NIL, NIL,
 									  ForwardScanDirection, false,
-									  NULL, 1.0, false);
+									  NULL, 1.0, false, 0);
 
 	return (seqScanAndSortPath.total_cost < indexScanPath->path.total_cost);
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 79af18e7a7..11d3b71ea5 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1019,7 +1019,8 @@ create_index_path(PlannerInfo *root,
 				  bool indexonly,
 				  Relids required_outer,
 				  double loop_count,
-				  bool partial_path)
+				  bool partial_path,
+				  int skip_prefix)
 {
 	IndexPath  *pathnode = makeNode(IndexPath);
 	RelOptInfo *rel = index->rel;
@@ -1039,6 +1040,7 @@ create_index_path(PlannerInfo *root,
 	pathnode->indexorderbys = indexorderbys;
 	pathnode->indexorderbycols = indexorderbycols;
 	pathnode->indexscandir = indexscandir;
+	pathnode->indexskipprefix = skip_prefix;
 
 	cost_index(pathnode, root, loop_count, partial_path);
 
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 53d974125f..0f66eab42d 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -210,7 +210,9 @@ static bool get_actual_variable_endpoint(Relation heapRel,
 										 MemoryContext outercontext,
 										 Datum *endpointDatum);
 static RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids);
-
+static double estimate_num_groups_internal(PlannerInfo *root, List *groupExprs,
+									double input_rows, double rel_input_rows,
+									List **pgset);
 
 /*
  *		eqsel			- Selectivity of "=" for any data types.
@@ -3359,6 +3361,19 @@ add_unique_group_var(PlannerInfo *root, List *varinfos,
 double
 estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 					List **pgset)
+{
+	return estimate_num_groups_internal(root, groupExprs, input_rows, -1, pgset);
+}
+
+/*
+ * Same as estimate_num_groups, but with an extra argument to control
+ * the estimation used for the input rows of the relation. If
+ * rel_input_rows < 0, it uses the the original planner estimation for the
+ * individual rels, else if uses the estimation as provided to the function.
+ */
+static double
+estimate_num_groups_internal(PlannerInfo *root, List *groupExprs, double input_rows, double rel_input_rows,
+					List **pgset)
 {
 	List	   *varinfos = NIL;
 	double		srf_multiplier = 1.0;
@@ -3513,6 +3528,12 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 		int			relvarcount = 0;
 		List	   *newvarinfos = NIL;
 		List	   *relvarinfos = NIL;
+		double this_rel_input_rows;
+
+		if (rel_input_rows < 0.0)
+			this_rel_input_rows = rel->rows;
+		else
+			this_rel_input_rows = rel_input_rows;
 
 		/*
 		 * Split the list of varinfos in two - one for the current rel, one
@@ -3610,7 +3631,7 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 			 * guarding against division by zero when reldistinct is zero.
 			 * Also skip this if we know that we are returning all rows.
 			 */
-			if (reldistinct > 0 && rel->rows < rel->tuples)
+			if (reldistinct > 0 && this_rel_input_rows < rel->tuples)
 			{
 				/*
 				 * Given a table containing N rows with n distinct values in a
@@ -3647,7 +3668,7 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 				 * works well even when n is small.
 				 */
 				reldistinct *=
-					(1 - pow((rel->tuples - rel->rows) / rel->tuples,
+					(1 - pow((rel->tuples - this_rel_input_rows) / rel->tuples,
 							 rel->tuples / reldistinct));
 			}
 			reldistinct = clamp_row_est(reldistinct);
@@ -6247,8 +6268,10 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	double		numIndexTuples;
 	Cost		descentCost;
 	List	   *indexBoundQuals;
+	List	   *prefixBoundQuals;
 	int			indexcol;
 	bool		eqQualHere;
+	bool		stillEq;
 	bool		found_saop;
 	bool		found_is_null_op;
 	double		num_sa_scans;
@@ -6272,9 +6295,11 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
+	prefixBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
+	stillEq = true;
 	found_is_null_op = false;
 	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
@@ -6286,11 +6311,18 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 		{
 			/* Beginning of a new column's quals */
 			if (!eqQualHere)
-				break;			/* done if no '=' qual for indexcol */
+			{
+				stillEq = false;
+				/* done if no '=' qual for indexcol and we're past the skip prefix */
+				if (path->indexskipprefix <= indexcol)
+					break;
+			}
 			eqQualHere = false;
 			indexcol++;
+			while (indexcol != iclause->indexcol && path->indexskipprefix > indexcol)
+				indexcol++;
 			if (indexcol != iclause->indexcol)
-				break;			/* no quals at all for indexcol */
+				break; /* no quals at all for indexcol */
 		}
 
 		/* Examine each indexqual associated with this index clause */
@@ -6322,7 +6354,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 				clause_op = saop->opno;
 				found_saop = true;
 				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
+				if (alength > 1 && stillEq)
 					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
@@ -6350,7 +6382,14 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 					eqQualHere = true;
 			}
 
-			indexBoundQuals = lappend(indexBoundQuals, rinfo);
+			/* we keep two lists here, one with all quals up until the prefix
+			 * and one with only the quals until the first inequality.
+			 * we need the list with prefixes later
+			 */
+			if (stillEq)
+				indexBoundQuals = lappend(indexBoundQuals, rinfo);
+			if (path->indexskipprefix > 0)
+				prefixBoundQuals = lappend(prefixBoundQuals, rinfo);
 		}
 	}
 
@@ -6376,7 +6415,10 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 		 * index-bound quals to produce a more accurate idea of the number of
 		 * rows covered by the bound conditions.
 		 */
-		selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
+		if (path->indexskipprefix > 0)
+			selectivityQuals = add_predicate_to_index_quals(index, prefixBoundQuals);
+		else
+			selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
 
 		btreeSelectivity = clauselist_selectivity(root, selectivityQuals,
 												  index->rel->relid,
@@ -6386,7 +6428,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 		/*
 		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
+		 * ScalarArrayOpExpr quals included in prefixBoundQuals, and then round
 		 * to integer.
 		 */
 		numIndexTuples = rint(numIndexTuples / num_sa_scans);
@@ -6432,6 +6474,99 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	costs.indexStartupCost += descentCost;
 	costs.indexTotalCost += costs.num_sa_scans * descentCost;
 
+	/*
+	 * Add extra costs for using an index skip scan.
+	 * The index skip scan could have significantly lower cost until now,
+	 * due to the different row estimation used (all the quals up to prefix,
+	 * rather than all the quals up to the first non-equality operator).
+	 * However, there are extra costs incurred for
+	 * a) setting up the scan
+	 * b) doing additional scans from root
+	 * c) small extra cost per tuple comparison
+	 * We add those here
+	 */
+	if (path->indexskipprefix > 0)
+	{
+		List *exprlist = NULL;
+		double numgroups_estimate;
+		int i = 0;
+		ListCell *indexpr_item = list_head(path->indexinfo->indexprs);
+		List	   *selectivityQuals;
+		Selectivity btreeSelectivity;
+		double estimatedIndexTuplesNoPrefix;
+
+		/* some rather arbitrary extra cost for preprocessing structures needed for skip scan */
+		costs.indexStartupCost += 200.0 * cpu_operator_cost;
+		costs.indexTotalCost += 200.0 * cpu_operator_cost;
+
+		/*
+		 * In order to reliably get a cost estimation for the number of scans we have to do from root,
+		 * we need some estimation on the number of distinct prefixes that exist. Therefore, we need
+		 * a different selectivity approximation (this time we do need to use the clauses until the first
+		 * non-equality operator). Using that, we can estimate the number of groups
+		 */
+		for (i = 0; i < path->indexinfo->nkeycolumns && i < path->indexskipprefix; i++)
+		{
+			Expr *expr = NULL;
+			int attr = path->indexinfo->indexkeys[i];
+			if(attr > 0)
+			{
+				TargetEntry *tentry = get_tle_by_resno(path->indexinfo->indextlist, i + 1);
+				Assert(tentry != NULL);
+				expr = tentry->expr;
+			}
+			else if (attr == 0)
+			{
+				/* Expression index */
+				expr = lfirst(indexpr_item);
+				indexpr_item = lnext(path->indexinfo->indexprs, indexpr_item);
+			}
+			else /* attr < 0 */
+			{
+				/* Index on system column is not supported */
+				Assert(false);
+			}
+
+			exprlist = lappend(exprlist, expr);
+		}
+
+		selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
+
+		btreeSelectivity = clauselist_selectivity(root, selectivityQuals,
+												  index->rel->relid,
+												  JOIN_INNER,
+												  NULL);
+		estimatedIndexTuplesNoPrefix = btreeSelectivity * index->rel->tuples;
+
+		/*
+		 * As in genericcostestimate(), we have to adjust for any
+		 * ScalarArrayOpExpr quals included in prefixBoundQuals, and then round
+		 * to integer.
+		 */
+		estimatedIndexTuplesNoPrefix = rint(estimatedIndexTuplesNoPrefix / num_sa_scans);
+
+		numgroups_estimate = estimate_num_groups_internal(
+					root, exprlist, estimatedIndexTuplesNoPrefix,
+					estimatedIndexTuplesNoPrefix, NULL);
+
+		/*
+		 * For each distinct prefix value we add descending cost as.
+		 * This is similar to the startup cost calculation for regular scans.
+		 * We can do at most 2 scans from root per distinct prefix, so multiply by 2.
+		 * Also add some CPU processing cost per page that we need to process, plus
+		 * some additional one-time cost for scanning the leaf page. This is a more
+		 * expensive estimation than the per-page cpu cost for the regular index scan.
+		 * This is intentional, because the index skip scan does more processing on
+		 * the leaf page.
+		 */
+		if (index->tuples > 0)
+			descentCost = ceil(log(index->tuples) / log(2.0)) * cpu_operator_cost * 2;
+		else
+			descentCost = 0;
+		descentCost += (index->tree_height + 1) * 50.0 * cpu_operator_cost * 2 + 200 * cpu_operator_cost;
+		costs.indexTotalCost += costs.num_sa_scans * descentCost * numgroups_estimate;
+	}
+
 	/*
 	 * If we can get an estimate of the first column's ordering correlation C
 	 * from pg_statistic, estimate the index correlation as C for a
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 8ec1780a56..4fa9adbfee 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -49,7 +49,8 @@ extern IndexPath *create_index_path(PlannerInfo *root,
 									bool indexonly,
 									Relids required_outer,
 									double loop_count,
-									bool partial_path);
+									bool partial_path,
+									int skip_prefix);
 extern BitmapHeapPath *create_bitmap_heap_path(PlannerInfo *root,
 											   RelOptInfo *rel,
 											   Path *bitmapqual,
-- 
2.27.0