MDAM techniques and Index Skip Scan patch

Started by Peter Geogheganabout 4 years ago25 messages

pg@bowt.ie

about 4 years ago

I returned to the 1995 paper "Efficient Search of Multidimensional
B-Trees" [1]http://vldb.org/conf/1995/P710.PDF as part of the process of reviewing v39 of the skip scan
patch, which was posted back in May. It's a great paper, and anybody
involved in the skip scan effort should read it thoroughly (if they
haven't already). It's easy to see why people get excited about skip
scan [2]https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/. But there is a bigger picture here.

I don't necessarily expect to come away from this discussion with a
much better high level architecture for the patch, or any kind of
deeper insight, or even a frame of reference for further discussion. I
just think that we ought to *try* to impose some order on this stuff.

Like many difficult patches, the skip scan patch is not so much
troubled by problems with the implementation as it is troubled by
*ambiguity* about the design. Particularly concerning how skip scan
meshes with existing designs, as well as future designs --
particularly designs for other MDAM techniques. I've started this
thread to have a big picture conversation about how to think about
these things. Many other MDAM techniques also seem highly appealing.
Much of the MDAM stuff is for data warehousing use-cases, while skip
scan/loose index scan is seen as more of an OLTP thing. But they are
still related, clearly.

I'd like to also talk about another patch, that ISTM had that same
quality -- it was also held back by high level design uncertainty. Back in 2018,
Tom abandoned a patch that transformed a star-schema style query with
left outer joins on dimension tables with OR conditions, into an
equivalent query that UNIONs together 2 distinct queries [3]/messages/by-id/7f70bd5a-5d16-e05c-f0b4-2fdfc8873489@BlueTreble.com[4]/messages/by-id/14593.1517581614@sss.pgh.pa.us -- Peter Geoghegan.

Believe it or not, I am now reminded of that patch by the example of
"IN() Lists", from page 5 of the paper. We see this example SQL query:

SELECT date, item_class, store, sum(total_sales)
FROM sales
WHERE date between '06/01/95' and '06/30/95' and
item_class IN (20,35,50) and
store IN (200,250)
GROUP BY dept, date, item_class, store;

Granted, this SQL might not seem directly relevant to Tom's patch at
first -- there is no join for the optimizer to even try to eliminate,
which was the whole basis of Jim Nasby's original complaint, which is
what spurred Tom to write the patch in the first place. But hear me
out: there is still a fact table (the sales table) with some
dimensions (the 'D' from 'MDAM') shown in the predicate. Moreover, the
table (and this SQL query) drives discussion of an optimization
involving transforming a predicate with many ORs (which is explicitly
said to be logically/semantically equivalent to the IN() lists from
the query). They transform the query into a bunch of disjunct clauses
that can easily be independently executed, and combined at the end
(see also "General OR Optimization" on page 6 of the paper).

Also...I'm not entirely sure that the intended underlying "physical
plan" is truly free of join-like scans. If you squint just right, you
might see something that you could think of as a "physical join" (at
least very informally). The whole point of this particular "IN()
Lists" example is that we get to the following, for each distinct
"dept" and "date" in the table:

dept=1, date='06/04/95', item_class=20, store=200
dept=1, date='06/04/95', item_class=20, store=250
dept=1, date='06/04/95', item_class=35, store=200
dept=1, date='06/04/95', item_class=35, store=250
dept=1, date='06/04/95', item_class=50, store=200
dept=1, date='06/04/95', item_class=50, store=250

There are 2400 such accesses in total after transformation -- imagine
additional lines like these, for every distinct combination of dept
and date (only for those dates that actually had sales, which they
enumerate up-front), for store 200 and 250, and item_class 20, 35, and
50. This adds up to 2400 lines in total. Even 2400 index probes will
be much faster than a full table scan, given that this is a large fact
table. The "sales" table is a clustered index whose keys are on the
columns "(dept, date, item_class, store)", per note at the top of page
4. The whole point is to avoid having any secondary indexes on this
fact table, without getting a full scan. We can just probe the primary
key 2400 times instead, following this transformation. No need for
secondary indexes.

The plan can be thought of as a DAG, at least informally. This is also
somewhat similar to what Tom was thinking about back in 2018. Tom had
to deduplicate rows during execution (IIRC using a UNION style ad-hoc
approach that sorted on TIDs), whereas I think that they can get away
with skipping that extra step. Page 7 says "MDAM removes duplicates
before reading the data, so it does not have to do any post read
operations to accomplish duplicate elimination (a common problem with
OR optimization)".

My general concern is that the skip scan patch may currently be
structured in a way that paints us into a corner, MDAM-wise.

Note that the MDAM paper treats skipping a prefix of columns as a case
where the prefix is handled by pretending that there is a clause that
looks like this: "WHERE date between -inf AND +inf" -- which is not so
different from the original sales SQL query example that I have
highlighted. We don't tend to think of queries like this (like my
sales query) as in any way related to skip-scan, because we don't
imagine that there is any skipping going on. But maybe we should
recognize the similarities.

BTW, these imaginary -inf/+inf values seem to me to be just like the
sentinel values already used inside nbtree, for pivot tuples -- they
have explicit -inf values for truncated suffix key columns, and you
can think of a rightmost page as having a +inf high key, per the L&Y
paper. Wearing my B-Tree hat, I don't see much difference between
imaginary -inf/+inf values, and values from the BETWEEN "date" range
from the example SQL query. I have in the past wondered if
_bt_get_endpoint() should have been implemented that way -- we could
go through _bt_search() instead, and get rid of that code. All we need
is insertion scan keys that can explicitly contain the same -inf/+inf
sentinel values. Maybe this also allows us to get rid of
BTScanInsertData.nextkey semantics (not sure offhand).

Another more concrete concern about the patch series comes from the
backwards scan stuff. This is added by a later patch in the patch
series, "v39-0004-Extend-amskip-implementation-for-Btree.patch". It
strikes me as a bad thing that we cannot just do leaf-page-at-a-time
processing, without usually needing to hold a pin on the leaf page.
After all, ordinary backwards scans manage to avoid that today, albeit
by using trickery inside _bt_walk_left(). MDAM-style "Maintenance of
Index Order" (as described on page 8) seems like a good goal for us
here. I don't like the idea of doing ad-hoc duplicate TID elimination
inside nbtree, across calls made from the executor (whether it's
during backwards skip scans, or at any other time). Not because it
seems to go against the approach taken by the MDAM paper (though it
does); just because it seems kludgy. (I think that Tom felt the same
way about the TID deduplication stuff in his own patch back in 2018,
too.)

Open question: What does all of this MDAM business mean for
ScalarArrayOpExpr, if anything?

I freely admit that I could easily be worrying over nothing here. But
if I am, I'd really like to know *why* that's the case.

[1]: http://vldb.org/conf/1995/P710.PDF
[2]: https://blog.timescale.com/blog/how-we-made-distinct-queries-up-to-8000x-faster-on-postgresql/
[3]: /messages/by-id/7f70bd5a-5d16-e05c-f0b4-2fdfc8873489@BlueTreble.com
[4]: /messages/by-id/14593.1517581614@sss.pgh.pa.us -- Peter Geoghegan
--
Peter Geoghegan

Floris Van Nee

florisvannee@Optiver.com

about 4 years ago

In reply to: Peter Geoghegan (#1)

6 attachment(s)

RE: MDAM techniques and Index Skip Scan patch

Great to see some interest in the skip scan patch series again!

Like many difficult patches, the skip scan patch is not so much troubled by
problems with the implementation as it is troubled by
*ambiguity* about the design. Particularly concerning how skip scan meshes
with existing designs, as well as future designs -- particularly designs for
other MDAM techniques. I've started this thread to have a big picture
conversation about how to think about these things. Many other MDAM
techniques also seem highly appealing.

I think it is good to have this discussion. In my opinion, Postgres could make really good use of some of the described MDAM techniques.

Much of the MDAM stuff is for data warehousing use-cases, while skip
scan/loose index scan is seen as more of an OLTP thing. But they are still
related, clearly.

FWIW I think skip scan is very much data warehousing use-case related - hence why the TimescaleDb people in your [2] reference implemented a simple form of it already for their extension. Skip scan is a really useful feature for large data sets. However, I agree it is only one part of the bigger MDAM picture.

My general concern is that the skip scan patch may currently be structured in
a way that paints us into a corner, MDAM-wise.

One of the concerns I raised before was that the patch may be thinking too simplistically on some things, that would make it difficult to adopt more complex optimizations in the future. One concrete example can be illustrated by a different query on the sales table of the paper's example:

SELECT DISTINCT dept, date WHERE item_class = 100

This should skip with prefix of (dept, date). Suppose we're at (dept, date) = (1, 2021-01-01) and it's skipping to the next prefix, the patch just implements what the MDAM paper describes as the 'probing' step. It finds the beginning of the next prefix. This could be for example (dept, date, item_class) = (1, 2021-01-02, 1). From there onwards, it would just scan the index until it finds item_class=100. What it should do however, is to first 'probe' for the next prefix value and then skip directly to (1, 2021-01-02, 100) (skipping item_class 1-99 altogether). The problem if it doesn't support this is that skip scan could have a quite unpredictable performance, because sometimes it'll end up going through most of the index where it should be skipping.

A while ago, I spent quite some time trying to come up with an implementation that would work in this more general case. The nice thing is that with such a more generic implementation, you get almost all the features from the MDAM paper almost for free. I focused on the executor code, not on the planner code - the planner code is for the DISTINCT skip part very similar to the original patch and I hacked in a way to make it choose a 'skip scan' also for non-DISTINCT queries for testing purposes. For this discussion about MDAM, the planner part is less relevant though. There's still a lot of discussion and work on the planner-side too, but I think that is completely independent from each other.

The more generic patch I originally posted in [1]/messages/by-id/c5c5c974714a47f1b226c857699e8680@opammb0561.comp.optiver.com, together with some technical considerations. That was quite a while ago so it obviously doesn't apply anymore on master. Therefore, I've attached a rebased version. Unfortunately, it's still based on an older version of the UniqueKeys patch - but since that patch is all planner machinery as well, it doesn't matter so much for the discussion about the executor code either.

I believe if we want something that fits better with future MDAM use cases, we should take a closer look at the executor code of this patch to drive this discussion. The logic is definitely more complex than the original patch, however I believe it is also more flexible and future-proof.

Another more concrete concern about the patch series comes from the
backwards scan stuff. This is added by a later patch in the patch series, "v39-
0004-Extend-amskip-implementation-for-Btree.patch". It strikes me as a bad
thing that we cannot just do leaf-page-at-a-time processing, without usually
needing to hold a pin on the leaf page.
After all, ordinary backwards scans manage to avoid that today, albeit by
using trickery inside _bt_walk_left(). MDAM-style "Maintenance of Index
Order" (as described on page 8) seems like a good goal for us here. I don't
like the idea of doing ad-hoc duplicate TID elimination inside nbtree, across
calls made from the executor (whether it's during backwards skip scans, or at
any other time). Not because it seems to go against the approach taken by
the MDAM paper (though it does); just because it seems kludgy. (I think that
Tom felt the same way about the TID deduplication stuff in his own patch
back in 2018,
too.)

It's good to mention that the patch I attached does proper 'leaf-page-at-a-time' processing, so it avoids the problem you describe with v39. It is implemented instead in the same way as a "regular" index scan - we process the full leaf page and store the matched tuples in the local state. If a DISTINCT scan wants to do a skip, we check our local state first if that skipping would be possible with the matched tuples from the current page. Then we avoid double work and also the need to look at the same page again.

Open question: What does all of this MDAM business mean for
ScalarArrayOpExpr, if anything?

This is a really interesting combination actually. I think, ideally, you'd probably get rid of it and provide full support for that with the 'skip' based approach (essentially the ScalarArrayOpExpr seems to do some form of skipping already - it transforms x IN (1,2,3) into 3 separate index scans for x).
However, even without doing any work on it, it actually interacts nicely with the skip based approach.

As an example, here's some queries based on the 'sales' table of the paper with some data in it (18M rows total, see sales_query.sql attachment for the full example):

-- terminology from paper: "intervening range predicate"
select date, sum(total_sales)
from sales
where dept between 2 and 3 and date between '2021-01-05' and '2021-01-10' and item_class=20 and store=250
group by dept, date
;
Patch: Execution Time: 0.368 ms
Master: Execution Time: 416.792 ms

-- terminology from paper: "missing key predicate"
select date, sum(total_sales)
from sales
where date between '2021-01-05' and '2021-01-10' and item_class=20 and store=250
group by dept, date
;
Patch: Execution Time: 0.667 ms
Master: Execution Time: 654.684 ms

-- terminology from paper: "IN lists"
-- this is similar to the query from your example with an IN list
-- in the current patch, this query is done with a skip scan with skip prefix (dept, date) and then the ScalarOpArray for item_class=(20,30,50)
select date, sum(total_sales)
from sales
where date between '2021-01-05' and '2021-01-10' and item_class in (20, 30, 50) and store=250
group by dept, date
;
Patch: Execution Time: 1.767 ms
Master: Execution Time: 629.792 ms

The other mentioned MDAM optimizations in the paper (NOT =, general OR) are not implemented but don't seem to be conflicting at all with the current implementation - they seem to be just a matter of transforming the expressions into the right form.

From the patch series above, v9-0001/v9-0002 is the UniqueKeys patch series, which focuses on the planner. v2-0001 is Dmitry's patch that extends it to make it possible to use UniqueKeys for the skip scan. Both of these are unfortunately still older patches, but because they are planner machinery they are less relevant to the discussion about the executor here.
Patch v2-0002 contains most of my work and introduces all the executor logic for the skip scan and hooks up the planner for DISTINCT queries to use the skip scan.
Patch v2-0003 is a planner hack that makes the planner pick a skip scan on virtually every possibility. This also enables the skip scan on the queries above that don't have a DISTINCT but where it could be useful.

-Floris

[1]: /messages/by-id/c5c5c974714a47f1b226c857699e8680@opammb0561.comp.optiver.com

Attachments:

sales_query.sqlapplication/octet-stream; name=sales_query.sqlDownload

v9-0001-Introduce-RelOptInfo-notnullattrs-attribute.patchapplication/octet-stream; name=v9-0001-Introduce-RelOptInfo-notnullattrs-attribute.patchDownload

From 4b0e494612199483e61073fda1f32b7eea174b44 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Sun, 3 May 2020 22:37:46 +0800
Subject: [PATCH 1/5] Introduce RelOptInfo->notnullattrs attribute

The notnullattrs is calculated from catalog and run-time query. That
infomation is translated to child relation as well for partitioned
table.
---
 src/backend/optimizer/path/allpaths.c  | 31 ++++++++++++++++++++++++++
 src/backend/optimizer/plan/initsplan.c | 10 +++++++++
 src/backend/optimizer/util/plancat.c   | 10 +++++++++
 src/include/nodes/pathnodes.h          |  2 ++
 4 files changed, 53 insertions(+)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 296dd75c1b..acca3755a8 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -999,6 +999,7 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 		RelOptInfo *childrel;
 		ListCell   *parentvars;
 		ListCell   *childvars;
+		int i = -1;
 
 		/* append_rel_list contains all append rels; ignore others */
 		if (appinfo->parent_relid != parentRTindex)
@@ -1055,6 +1056,36 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 								   (Node *) rel->reltarget->exprs,
 								   1, &appinfo);
 
+		/* Copy notnullattrs. */
+		while ((i = bms_next_member(rel->notnullattrs, i)) > 0)
+		{
+			AttrNumber attno = i + FirstLowInvalidHeapAttributeNumber;
+			AttrNumber child_attno;
+			if (attno == 0)
+			{
+				/* Whole row is not null, so must be same for child */
+				childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+														attno - FirstLowInvalidHeapAttributeNumber);
+				break;
+			}
+			if (attno < 0 )
+				/* no need to translate system column */
+				child_attno = attno;
+			else
+			{
+				Node * node = list_nth(appinfo->translated_vars, attno - 1);
+				if (!IsA(node, Var))
+					/* This may happens at UNION case, like (SELECT a FROM t1 UNION SELECT a + 3
+					 * FROM t2) t and we know t.a is not null
+					 */
+					continue;
+				child_attno = castNode(Var, node)->varattno;
+			}
+
+			childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+													child_attno - FirstLowInvalidHeapAttributeNumber);
+		}
+
 		/*
 		 * We have to make child entries in the EquivalenceClass data
 		 * structures as well.  This is needed either if the parent
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index e25dc9a7ca..09c74e2a5a 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -831,6 +831,16 @@ deconstruct_recurse(PlannerInfo *root, Node *jtnode, bool below_outer_join,
 		{
 			Node	   *qual = (Node *) lfirst(l);
 
+			/* Set the not null info now */
+			ListCell	*lc;
+			List		*non_nullable_vars = find_nonnullable_vars(qual);
+			foreach(lc, non_nullable_vars)
+			{
+				Var *var = lfirst_node(Var, lc);
+				RelOptInfo *rel = root->simple_rel_array[var->varno];
+				rel->notnullattrs = bms_add_member(rel->notnullattrs,
+												   var->varattno - FirstLowInvalidHeapAttributeNumber);
+			}
 			distribute_qual_to_rels(root, qual,
 									below_outer_join, JOIN_INNER,
 									root->qual_security_level,
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index c5194fdbbf..a50f897ffa 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -117,6 +117,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	Relation	relation;
 	bool		hasindex;
 	List	   *indexinfos = NIL;
+	int			i;
 
 	/*
 	 * We need not lock the relation since it was already locked, either by
@@ -471,6 +472,15 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	if (inhparent && relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		set_relation_partition_info(root, rel, relation);
 
+	Assert(rel->notnullattrs == NULL);
+	for(i = 0; i < relation->rd_att->natts; i++)
+	{
+		FormData_pg_attribute attr = relation->rd_att->attrs[i];
+		if (attr.attnotnull)
+			rel->notnullattrs = bms_add_member(rel->notnullattrs,
+											   attr.attnum - FirstLowInvalidHeapAttributeNumber);
+	}
+
 	table_close(relation, NoLock);
 
 	/*
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 2a53a6e344..c624611784 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -727,6 +727,8 @@ typedef struct RelOptInfo
 	int			rel_parallel_workers;	/* wanted number of parallel workers */
 	uint32		amflags;		/* Bitmask of optional features supported by
 								 * the table AM */
+	/* Not null attrs, start from -FirstLowInvalidHeapAttributeNumber */
+	Bitmapset		*notnullattrs;
 
 	/* Information about foreign tables and foreign joins */
 	Oid			serverid;		/* identifies server for the table or join */
-- 
2.29.2

v9-0002-Introduce-UniqueKey-attributes-on-RelOptInfo-stru.patchapplication/octet-stream; name=v9-0002-Introduce-UniqueKey-attributes-on-RelOptInfo-stru.patchDownload

From 08aee9865f48cc8bb8c429af7f0058a8fc4991b5 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Mon, 11 May 2020 15:50:52 +0800
Subject: [PATCH 2/5] Introduce UniqueKey attributes on RelOptInfo struct.

UniqueKey is a set of exprs on RelOptInfo which represents the exprs
will be unique on the given RelOptInfo. You can see README.uniquekey
for more information.
---
 src/backend/nodes/copyfuncs.c               |   13 +
 src/backend/nodes/list.c                    |   31 +
 src/backend/nodes/makefuncs.c               |   13 +
 src/backend/nodes/outfuncs.c                |   11 +
 src/backend/nodes/readfuncs.c               |   10 +
 src/backend/optimizer/path/Makefile         |    3 +-
 src/backend/optimizer/path/README.uniquekey |  131 +++
 src/backend/optimizer/path/allpaths.c       |   10 +
 src/backend/optimizer/path/joinpath.c       |    9 +-
 src/backend/optimizer/path/joinrels.c       |    2 +
 src/backend/optimizer/path/pathkeys.c       |    3 +-
 src/backend/optimizer/path/uniquekeys.c     | 1134 +++++++++++++++++++
 src/backend/optimizer/plan/planner.c        |   15 +-
 src/backend/optimizer/prep/prepunion.c      |    2 +
 src/backend/optimizer/util/appendinfo.c     |   44 +
 src/backend/optimizer/util/inherit.c        |   18 +-
 src/include/nodes/makefuncs.h               |    3 +
 src/include/nodes/nodes.h                   |    1 +
 src/include/nodes/pathnodes.h               |   29 +-
 src/include/nodes/pg_list.h                 |    1 +
 src/include/optimizer/appendinfo.h          |    3 +
 src/include/optimizer/optimizer.h           |    2 +
 src/include/optimizer/paths.h               |   43 +
 23 files changed, 1506 insertions(+), 25 deletions(-)
 create mode 100644 src/backend/optimizer/path/README.uniquekey
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 70e9e54d3e..652ba7f8ee 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2320,6 +2320,16 @@ _copyPathKey(const PathKey *from)
 	return newnode;
 }
 
+static UniqueKey *
+_copyUniqueKey(const UniqueKey *from)
+{
+	UniqueKey	*newnode = makeNode(UniqueKey);
+
+	COPY_NODE_FIELD(exprs);
+	COPY_SCALAR_FIELD(multi_nullvals);
+
+	return newnode;
+}
 /*
  * _copyRestrictInfo
  */
@@ -5313,6 +5323,9 @@ copyObjectImpl(const void *from)
 		case T_PathKey:
 			retval = _copyPathKey(from);
 			break;
+		case T_UniqueKey:
+			retval = _copyUniqueKey(from);
+			break;
 		case T_RestrictInfo:
 			retval = _copyRestrictInfo(from);
 			break;
diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index 94fb236daf..fa840511ac 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -702,6 +702,37 @@ list_member_oid(const List *list, Oid datum)
 	return false;
 }
 
+/*
+ * return true iff every entry in "members" list is also present
+ * in the "target" list.
+ */
+bool
+list_is_subset(const List *members, const List *target)
+{
+	const ListCell	*lc1, *lc2;
+
+	Assert(IsPointerList(members));
+	Assert(IsPointerList(target));
+	check_list_invariants(members);
+	check_list_invariants(target);
+
+	foreach(lc1, members)
+	{
+		bool found = false;
+		foreach(lc2, target)
+		{
+			if (equal(lfirst(lc1), lfirst(lc2)))
+			{
+				found = true;
+				break;
+			}
+		}
+		if (!found)
+			return false;
+	}
+	return true;
+}
+
 /*
  * Delete the n'th cell (counting from 0) in list.
  *
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 7d1a01d1ed..da49133b82 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -815,3 +815,16 @@ makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols)
 	v->va_cols = va_cols;
 	return v;
 }
+
+
+/*
+ * makeUniqueKey
+ */
+UniqueKey*
+makeUniqueKey(List *exprs, bool multi_nullvals)
+{
+	UniqueKey * ukey = makeNode(UniqueKey);
+	ukey->exprs = exprs;
+	ukey->multi_nullvals = multi_nullvals;
+	return ukey;
+}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 2e5ed77e18..5a25a50edc 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2511,6 +2511,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+	WRITE_NODE_FIELD(exprs);
+	WRITE_BOOL_FIELD(multi_nullvals);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4293,6 +4301,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 				_outPathKey(str, obj);
 				break;
+			case T_UniqueKey:
+				_outUniqueKey(str, obj);
+				break;
 			case T_PathTarget:
 				_outPathTarget(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index abf08b7a2f..54d97ac3d0 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -494,6 +494,14 @@ _readSetOperationStmt(void)
 	READ_DONE();
 }
 
+static UniqueKey *
+_readUniqueKey(void)
+{
+	READ_LOCALS(UniqueKey);
+	READ_NODE_FIELD(exprs);
+	READ_BOOL_FIELD(multi_nullvals);
+	READ_DONE();
+}
 
 /*
  *	Stuff from primnodes.h.
@@ -2745,6 +2753,8 @@ parseNodeString(void)
 		return_value = _readCommonTableExpr();
 	else if (MATCH("SETOPERATIONSTMT", 16))
 		return_value = _readSetOperationStmt();
+	else if (MATCH("UNIQUEKEY", 9))
+		return_value = _readUniqueKey();
 	else if (MATCH("ALIAS", 5))
 		return_value = _readAlias();
 	else if (MATCH("RANGEVAR", 8))
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..7b9820c25f 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekeys.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/README.uniquekey b/src/backend/optimizer/path/README.uniquekey
new file mode 100644
index 0000000000..5eac761995
--- /dev/null
+++ b/src/backend/optimizer/path/README.uniquekey
@@ -0,0 +1,131 @@
+1. What is UniqueKey?
+We can think UniqueKey is a set of exprs for a RelOptInfo, which we are insure
+that doesn't yields same result among all the rows. The simplest UniqueKey
+format is primary key.
+
+However we define the UnqiueKey as below.
+
+typedef struct UniqueKey
+{
+        NodeTag	type;
+        List	*exprs;
+        bool	multi_nullvals;
+} UniqueKey;
+
+exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+is a special case of UniqueKey, which means there is only one row in that
+relation.it has a stronger semantic than others. like SELECT uk FROM t; uk is
+normal unique key and may have different values. SELECT colx FROM t WHERE uk =
+const.  colx is unique AND we have only 1 value. This field can used for
+innerrel_is_unique. this logic is handled specially in add_uniquekey_for_onerow
+function.
+
+multi_nullvals: true means multi null values may exist in these exprs, so the
+uniqueness is not guaranteed in this case. This field is necessary for
+remove_useless_join & reduce_unique_semijoins where we don't mind these
+duplicated NULL values. It is set to true for 2 cases. One is a unique key
+from a unique index but the related column is nullable. The other one is for
+outer join. see populate_joinrel_uniquekeys for detail.
+
+
+The UniqueKey can be used at the following cases at least:
+1. remove_useless_joins.
+2. reduce_semianti_joins
+3. remove distinct node if distinct clause is unique.
+4. remove aggnode if group by clause is unique.
+5. Index Skip Scan (WIP)
+6. Aggregation Push Down without 2 phase aggregation if the join can't
+   duplicated the aggregated rows. (work in progress feature)
+
+2. How is it maintained?
+
+We have a set of populate_xxx_unqiuekeys functions to maintain the uniquekey on
+various cases. xxx includes baserel, joinrel, partitionedrel, distinctrel,
+groupedrel, unionrel. and we also need to convert the uniquekey from subquery
+to outer relation, which is what convert_subquery_uniquekeys does.
+
+1. The first part is about baserel. We handled 3 cases. suppose we have Unique
+Index on (a, b).
+
+1. SELECT a, b FROM t.  UniqueKey (a, b)
+2. SELECT a FROM t WHERE b = 1;  UniqueKey (a)
+3. SELECT .. FROM t WHERE a = 1 AND b = 1;  UniqueKey (NIL).  onerow case, every
+   column is Unique.
+
+2. The next part is joinrel, this part is most error-prone, we simplified the rules
+like below:
+1. If the relation's UniqueKey can't be duplicated after join,  then is will be
+   still valid for the join rel. The function we used here is
+   innerrel_keeps_unique. The basic idea is innerrel.any_col = outer.uk.
+
+2. If the UnqiueKey can't keep valid via the rule 1, the combination of the
+   UniqueKey from both sides are valid for sure.  We can prove this as: if the
+   unique exprs from rel1 is duplicated by rel2, the duplicated rows must
+   contains different unique exprs from rel2.
+
+More considerations about onerow:
+1. If relation with one row and it can't be duplicated, it is still possible
+   contains mulit_nullvas after outer join.
+2. If the either UniqueKey can be duplicated after join, the can get one row
+   only when both side is one row AND there is no outer join.
+3. Whenever the onerow UniqueKey is not a valid any more, we need to convert one
+   row UniqueKey to normal unique key since we don't store exprs for one-row
+   relation. get_exprs_from_uniquekeys will be used here.
+
+
+More considerations about multi_nullvals after join:
+1. If the original UnqiueKey has multi_nullvals, the final UniqueKey will have
+   mulit_nullvals in any case.
+2. If a unique key doesn't allow mulit_nullvals, after some outer join, it
+   allows some outer join.
+
+
+3. When we comes to subquery, we need to convert_subquery_unqiuekeys just like
+convert_subquery_pathkeys.  Only the UniqueKey insides subquery is referenced as
+a Var in outer relation will be reused. The relationship between the outerrel.Var
+and subquery.exprs is built with outerel->subroot->processed_tlist.
+
+
+4. As for the SRF functions, it will break the uniqueness of uniquekey, However it
+is handled in adjust_paths_for_srfs, which happens after the query_planner. so
+we will maintain the UniqueKey until there and reset it to NIL at that
+places. This can't help on distinct/group by elimination cases but probably help
+in some other cases, like reduce_unqiue_semijoins/remove_useless_joins and it is
+semantic correctly.
+
+
+5. As for inherit table, we first main the UnqiueKey on childrel as well. But for
+partitioned table we need to maintain 2 different kinds of
+UnqiueKey. 1). UniqueKey on the parent relation 2). UniqueKey on child
+relation for partition wise query.
+
+Example:
+CREATE TABLE p (a int not null, b int not null) partition by list (a);
+CREATE TABLE p0 partition of p for values in (1);
+CREATE TABLE p1 partition of p for values in (2);
+
+create unique index p0_b on p0(b);
+create unique index p1_b on p1(b);
+
+Now b is only unique on partition level, so the distinct can't be removed on
+the following cases. SELECT DISTINCT b FROM p;
+
+Another example is SELECT DISTINCT a, b FROM p WHERE a = 1; Since only one
+partition is chosen, the UniqueKey on child relation is same as the UniqueKey on
+parent relation.
+
+Another usage of UniqueKey on partition level is it be helpful for
+partition-wise join.
+
+As for the UniqueKey on parent table level, it comes with 2 different ways,
+1). the UniqueKey is also derived in UniqueKey index, but the index must be same
+in all the related children relations and the unique index must contains
+Partition Key in it. Example:
+
+CREATE UNIQUE INDEX p_ab ON p(a, b);  -- where a is the partition key.
+
+-- Query
+SELECT a, b FROM p; the (a, b) is a UniqueKey of p.
+
+2). If the parent relation has only one childrel, the UniqueKey on childrel is
+ the UniqueKey on parent as well.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index acca3755a8..4dfc5d29bc 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -580,6 +580,12 @@ set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	 */
 	check_index_predicates(root, rel);
 
+	/*
+	 * Now that we've marked which partial indexes are suitable, we can now
+	 * build the relation's unique keys.
+	 */
+	populate_baserel_uniquekeys(root, rel, rel->indexlist);
+
 	/* Mark rel with estimated output rows, width, etc */
 	set_baserel_size_estimates(root, rel);
 }
@@ -1298,6 +1304,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 
 	/* Add paths to the append relation. */
 	add_paths_to_append_rel(root, rel, live_childrels);
+	if (IS_PARTITIONED_REL(rel))
+		populate_partitionedrel_uniquekeys(root, rel, live_childrels);
 }
 
 
@@ -2306,6 +2314,8 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 										  pathkeys, required_outer));
 	}
 
+	convert_subquery_uniquekeys(root, rel, sub_final_rel);
+
 	/* If outer rel allows parallelism, do same for partial paths. */
 	if (rel->consider_parallel && bms_is_empty(required_outer))
 	{
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index 6407ede12a..1f4ae2d69a 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -77,13 +77,6 @@ static void consider_parallel_mergejoin(PlannerInfo *root,
 static void hash_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 								 RelOptInfo *outerrel, RelOptInfo *innerrel,
 								 JoinType jointype, JoinPathExtraData *extra);
-static List *select_mergejoin_clauses(PlannerInfo *root,
-									  RelOptInfo *joinrel,
-									  RelOptInfo *outerrel,
-									  RelOptInfo *innerrel,
-									  List *restrictlist,
-									  JoinType jointype,
-									  bool *mergejoin_allowed);
 static void generate_mergejoin_paths(PlannerInfo *root,
 									 RelOptInfo *joinrel,
 									 RelOptInfo *innerrel,
@@ -2173,7 +2166,7 @@ hash_inner_and_outer(PlannerInfo *root,
  * if it is mergejoinable and involves vars from the two sub-relations
  * currently of interest.
  */
-static List *
+List *
 select_mergejoin_clauses(PlannerInfo *root,
 						 RelOptInfo *joinrel,
 						 RelOptInfo *outerrel,
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 16cc9269ef..1a444a1e7d 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -924,6 +924,8 @@ populate_joinrel_with_paths(PlannerInfo *root, RelOptInfo *rel1,
 
 	/* Apply partitionwise join technique, if possible. */
 	try_partitionwise_join(root, rel1, rel2, joinrel, sjinfo, restrictlist);
+
+	populate_joinrel_uniquekeys(root, joinrel, rel1, rel2, restrictlist, sjinfo->jointype);
 }
 
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 216dd26385..8b267de06f 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -33,7 +33,6 @@ static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
 											 int partkeycol);
-static Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 static bool right_merge_direction(PlannerInfo *root, PathKey *pathkey);
 
 
@@ -1035,7 +1034,7 @@ convert_subquery_pathkeys(PlannerInfo *root, RelOptInfo *rel,
  * We need this to ensure that we don't return pathkeys describing values
  * that are unavailable above the level of the subquery scan.
  */
-static Var *
+Var *
 find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle)
 {
 	ListCell   *lc;
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
new file mode 100644
index 0000000000..ca40c40858
--- /dev/null
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -0,0 +1,1134 @@
+/*-------------------------------------------------------------------------
+ *
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/uniquekeys.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/appendinfo.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/tlist.h"
+#include "rewrite/rewriteManip.h"
+
+
+/*
+ * This struct is used to help populate_joinrel_uniquekeys.
+ *
+ * added_to_joinrel is true if a uniquekey (from outerrel or innerrel)
+ * has been added to joinrel.
+ * useful is true if the exprs of the uniquekey still appears in joinrel.
+ */
+typedef struct UniqueKeyContextData
+{
+	UniqueKey	*uniquekey;
+	bool	added_to_joinrel;
+	bool	useful;
+} *UniqueKeyContext;
+
+static List *initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel);
+static bool innerrel_keeps_unique(PlannerInfo *root,
+								  RelOptInfo *outerrel,
+								  RelOptInfo *innerrel,
+								  List *restrictlist,
+								  bool reverse);
+
+static List *get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+										List *const_exprs,
+										List *const_expr_opfamilies,
+										Bitmapset *used_varattrs,
+										bool *useful,
+										bool *multi_nullvals);
+static List *get_exprs_from_uniquekey(PlannerInfo *root,
+									  RelOptInfo *joinrel,
+									  RelOptInfo *rel1,
+									  UniqueKey *ukey);
+static void add_uniquekey_for_onerow(RelOptInfo *rel);
+static bool add_combined_uniquekey(PlannerInfo *root,
+								   RelOptInfo *joinrel,
+								   RelOptInfo *outer_rel,
+								   RelOptInfo *inner_rel,
+								   UniqueKey *outer_ukey,
+								   UniqueKey *inner_ukey,
+								   JoinType jointype);
+
+/* Used for unique indexes checking for partitioned table */
+static bool index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind);
+static IndexOptInfo *simple_copy_indexinfo_to_parent(PlannerInfo *root,
+													 RelOptInfo *parentrel,
+													 IndexOptInfo *from);
+static bool simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2);
+static void adjust_partition_unique_indexlist(PlannerInfo *root,
+											  RelOptInfo *parentrel,
+											  RelOptInfo *childrel,
+											  List **global_unique_index);
+
+/* Helper function for grouped relation and distinct relation. */
+static void add_uniquekey_from_sortgroups(PlannerInfo *root,
+										  RelOptInfo *rel,
+										  List *sortgroups);
+
+/*
+ * populate_baserel_uniquekeys
+ *		Populate 'baserel' uniquekeys list by looking at the rel's unique index
+ * and baserestrictinfo
+ */
+void
+populate_baserel_uniquekeys(PlannerInfo *root,
+							RelOptInfo *baserel,
+							List *indexlist)
+{
+	ListCell *lc;
+	List	*matched_uniq_indexes = NIL;
+
+	/* Attrs appears in rel->reltarget->exprs. */
+	Bitmapset *used_attrs = NULL;
+
+	List	*const_exprs = NIL;
+	List	*expr_opfamilies = NIL;
+
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	foreach(lc, indexlist)
+	{
+		IndexOptInfo *ind = (IndexOptInfo *) lfirst(lc);
+		if (!ind->unique || !ind->immediate ||
+			(ind->indpred != NIL && !ind->predOK))
+			continue;
+		matched_uniq_indexes = lappend(matched_uniq_indexes, ind);
+	}
+
+	if (matched_uniq_indexes  == NIL)
+		return;
+
+	/* Check which attrs is used in baserel->reltarget */
+	pull_varattnos((Node *)baserel->reltarget->exprs, baserel->relid, &used_attrs);
+
+	/* Check which attrno is used at a mergeable const filter */
+	foreach(lc, baserel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+		if (rinfo->mergeopfamilies == NIL)
+			continue;
+
+		if (bms_is_empty(rinfo->left_relids))
+		{
+			const_exprs = lappend(const_exprs, get_rightop(rinfo->clause));
+		}
+		else if (bms_is_empty(rinfo->right_relids))
+		{
+			const_exprs = lappend(const_exprs, get_leftop(rinfo->clause));
+		}
+		else
+			continue;
+
+		expr_opfamilies = lappend(expr_opfamilies, rinfo->mergeopfamilies);
+	}
+
+	foreach(lc, matched_uniq_indexes)
+	{
+		bool	multi_nullvals, useful;
+		List	*exprs = get_exprs_from_uniqueindex(lfirst_node(IndexOptInfo, lc),
+													const_exprs,
+													expr_opfamilies,
+													used_attrs,
+													&useful,
+													&multi_nullvals);
+		if (useful)
+		{
+			if (exprs == NIL)
+			{
+				/* All the columns in Unique Index matched with a restrictinfo */
+				add_uniquekey_for_onerow(baserel);
+				return;
+			}
+			baserel->uniquekeys = lappend(baserel->uniquekeys,
+										  makeUniqueKey(exprs, multi_nullvals));
+		}
+	}
+}
+
+
+/*
+ * populate_partitionedrel_uniquekeys
+ * The UniqueKey on partitionrel comes from 2 cases:
+ * 1). Only one partition is involved in this query, the unique key can be
+ * copied to parent rel from childrel.
+ * 2). There are some unique index which includes partition key and exists
+ * in all the related partitions.
+ * We never mind rule 2 if we hit rule 1.
+ */
+
+void
+populate_partitionedrel_uniquekeys(PlannerInfo *root,
+								   RelOptInfo *rel,
+								   List *childrels)
+{
+	ListCell	*lc;
+	List	*global_uniq_indexlist = NIL;
+	RelOptInfo *childrel;
+	bool is_first = true;
+
+	Assert(IS_PARTITIONED_REL(rel));
+
+	if (childrels == NIL)
+		return;
+
+	/*
+	 * If there is only one partition used in this query, the UniqueKey in childrel is
+	 * still valid in parent level, but we need convert the format from child expr to
+	 * parent expr.
+	 */
+	if (list_length(childrels) == 1)
+	{
+		/* Check for Rule 1 */
+		RelOptInfo *childrel = linitial_node(RelOptInfo, childrels);
+		ListCell	*lc;
+		Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL);
+		if (relation_is_onerow(childrel))
+		{
+			add_uniquekey_for_onerow(rel);
+			return;
+		}
+
+		foreach(lc, childrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			AppendRelInfo *appinfo = find_appinfo_by_child(root, childrel->relid);
+			List *parent_exprs = NIL;
+			bool can_reuse = true;
+			ListCell	*lc2;
+			foreach(lc2, ukey->exprs)
+			{
+				Var *var = (Var *)lfirst(lc2);
+				/*
+				 * If the expr comes from a expression, it is hard to build the expression
+				 * in parent so ignore that case for now.
+				 */
+				if(!IsA(var, Var))
+				{
+					can_reuse = false;
+					break;
+				}
+				/* Convert it to parent var */
+				parent_exprs = lappend(parent_exprs, find_parent_var(appinfo, var));
+			}
+			if (can_reuse)
+				rel->uniquekeys = lappend(rel->uniquekeys,
+										  makeUniqueKey(parent_exprs,
+														ukey->multi_nullvals));
+		}
+	}
+	else
+	{
+		/* Check for rule 2 */
+		childrel = linitial_node(RelOptInfo, childrels);
+		foreach(lc, childrel->indexlist)
+		{
+			IndexOptInfo *ind = lfirst(lc);
+			IndexOptInfo *modified_index;
+			if (!ind->unique || !ind->immediate ||
+				(ind->indpred != NIL && !ind->predOK))
+				continue;
+
+			/*
+			 * During simple_copy_indexinfo_to_parent, we need to convert var from
+			 * child var to parent var, index on expression is too complex to handle.
+			 * so ignore it for now.
+			 */
+			if (ind->indexprs != NIL)
+				continue;
+
+			modified_index = simple_copy_indexinfo_to_parent(root, rel, ind);
+			/*
+			 * If the unique index doesn't contain partkey, then it is unique
+			 * on this partition only, so it is useless for us.
+			 */
+			if (!index_constains_partkey(rel, modified_index))
+				continue;
+
+			global_uniq_indexlist = lappend(global_uniq_indexlist,  modified_index);
+		}
+
+		if (global_uniq_indexlist != NIL)
+		{
+			foreach(lc, childrels)
+			{
+				RelOptInfo *child = lfirst(lc);
+				if (is_first)
+				{
+					is_first = false;
+					continue;
+				}
+				adjust_partition_unique_indexlist(root, rel, child, &global_uniq_indexlist);
+			}
+			/* Now we have a list of unique index which are exactly same on all childrels,
+			 * Set the UniqueKey just like it is non-partition table
+			 */
+			populate_baserel_uniquekeys(root, rel, global_uniq_indexlist);
+		}
+	}
+}
+
+
+/*
+ * populate_distinctrel_uniquekeys
+ */
+void
+populate_distinctrel_uniquekeys(PlannerInfo *root,
+								RelOptInfo *inputrel,
+								RelOptInfo *distinctrel)
+{
+	/* The unique key before the distinct is still valid. */
+	distinctrel->uniquekeys = list_copy(inputrel->uniquekeys);
+	add_uniquekey_from_sortgroups(root, distinctrel, root->parse->distinctClause);
+}
+
+/*
+ * populate_grouprel_uniquekeys
+ */
+void
+populate_grouprel_uniquekeys(PlannerInfo *root,
+							 RelOptInfo *grouprel,
+							 RelOptInfo *inputrel)
+
+{
+	Query *parse = root->parse;
+	bool input_ukey_added = false;
+	ListCell *lc;
+
+	if (relation_is_onerow(inputrel))
+	{
+		add_uniquekey_for_onerow(grouprel);
+		return;
+	}
+	if (parse->groupingSets)
+		return;
+
+	/* A Normal group by without grouping set. */
+	if (parse->groupClause)
+	{
+		/*
+		 * Current even the groupby clause is Unique already, but if query has aggref
+		 * We have to create grouprel still. To keep the UnqiueKey short, we will check
+		 * the UniqueKey of input_rel still valid, if so we reuse it.
+		 */
+		foreach(lc, inputrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(ukey->exprs, grouprel->reltarget->exprs))
+			{
+				grouprel->uniquekeys = lappend(grouprel->uniquekeys,
+											   ukey);
+				input_ukey_added = true;
+			}
+		}
+		if (!input_ukey_added)
+			/*
+			 * group by clause must be a super-set of grouprel->reltarget->exprs except the
+			 * aggregation expr, so if such exprs is unique already, no bother to generate
+			 * new uniquekey for group by exprs.
+			 */
+			add_uniquekey_from_sortgroups(root,
+										  grouprel,
+										  root->parse->groupClause);
+	}
+	else
+		/* It has aggregation but without a group by, so only one row returned */
+		add_uniquekey_for_onerow(grouprel);
+}
+
+/*
+ * simple_copy_uniquekeys
+ * Using a function for the one-line code makes us easy to check where we simply
+ * copied the uniquekey.
+ */
+void
+simple_copy_uniquekeys(RelOptInfo *oldrel,
+					   RelOptInfo *newrel)
+{
+	newrel->uniquekeys = oldrel->uniquekeys;
+}
+
+/*
+ *  populate_unionrel_uniquekeys
+ */
+void
+populate_unionrel_uniquekeys(PlannerInfo *root,
+							  RelOptInfo *unionrel)
+{
+	ListCell	*lc;
+	List	*exprs = NIL;
+
+	Assert(unionrel->uniquekeys == NIL);
+
+	foreach(lc, unionrel->reltarget->exprs)
+	{
+		exprs = lappend(exprs, lfirst(lc));
+	}
+
+	if (exprs == NIL)
+		/* SQL: select union select; is valid, we need to handle it here. */
+		add_uniquekey_for_onerow(unionrel);
+	else
+		unionrel->uniquekeys = lappend(unionrel->uniquekeys,
+									   makeUniqueKey(exprs,false));
+
+}
+
+/*
+ * populate_joinrel_uniquekeys
+ *
+ * populate uniquekeys for joinrel. We will check each relation to see if its
+ * UniqueKey is still valid via innerrel_keeps_unique, if so, we add it to
+ * joinrel.  The multi_nullvals field will be changed to true for some outer
+ * join cases and one-row UniqueKey needs to be converted to normal UniqueKey
+ * for the same case as well.
+ * For the uniquekey in either baserel which can't be unique after join, we still
+ * check to see if combination of UniqueKeys from both side is still useful for us.
+ * if yes, we add it to joinrel as well.
+ */
+void
+populate_joinrel_uniquekeys(PlannerInfo *root, RelOptInfo *joinrel,
+							RelOptInfo *outerrel, RelOptInfo *innerrel,
+							List *restrictlist, JoinType jointype)
+{
+	ListCell *lc, *lc2;
+	List	*clause_list = NIL;
+	List	*outerrel_ukey_ctx;
+	List	*innerrel_ukey_ctx;
+	bool	inner_onerow, outer_onerow;
+	bool	mergejoin_allowed;
+
+	/* Care about the outerrel relation only for SEMI/ANTI join */
+	if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
+	{
+		foreach(lc, outerrel->uniquekeys)
+		{
+			UniqueKey	*uniquekey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(uniquekey->exprs, joinrel->reltarget->exprs))
+				joinrel->uniquekeys = lappend(joinrel->uniquekeys, uniquekey);
+		}
+		return;
+	}
+
+	Assert(jointype == JOIN_LEFT || jointype == JOIN_FULL || jointype == JOIN_INNER);
+
+	/* Fast path */
+	if (innerrel->uniquekeys == NIL || outerrel->uniquekeys == NIL)
+		return;
+
+	inner_onerow = relation_is_onerow(innerrel);
+	outer_onerow = relation_is_onerow(outerrel);
+
+	outerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(outerrel);
+	innerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(innerrel);
+
+	clause_list = select_mergejoin_clauses(root, joinrel, outerrel, innerrel,
+										   restrictlist, jointype,
+										   &mergejoin_allowed);
+
+	if (innerrel_keeps_unique(root, innerrel, outerrel, clause_list, true /* reverse */))
+	{
+		bool outer_impact = jointype == JOIN_FULL;
+		foreach(lc, outerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			/* Outer relation has one row, and the unique key is not duplicated after join,
+			 * the joinrel will still has one row unless the jointype == JOIN_FULL.
+			 */
+			if (outer_onerow && !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (outer_onerow)
+			{
+				/*
+				 * The onerow outerrel becomes multi rows and multi_nullvals
+				 * will be changed to true. We also need to set the exprs correctly since it
+				 * can't be NIL any more.
+				 */
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(root, joinrel, outerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Change multi_nullvals to true due to the full join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs, true));
+				else
+					/* Just reuse it */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	if (innerrel_keeps_unique(root, outerrel, innerrel, clause_list, false))
+	{
+		bool outer_impact = jointype == JOIN_FULL || jointype == JOIN_LEFT;;
+
+		foreach(lc, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			if (inner_onerow &&  !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (inner_onerow)
+			{
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(root, joinrel, innerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Need to change multi_nullvals to true due to the outer join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs,
+																true));
+				else
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	/*
+	 * The combination of the UniqueKey from both sides is unique as well regardless
+	 * of join type, but no bother to add it if its subset has been added to joinrel
+	 * already or it is not useful for the joinrel.
+	 */
+	foreach(lc, outerrel_ukey_ctx)
+	{
+		UniqueKeyContext ctx1 = (UniqueKeyContext) lfirst(lc);
+		if (ctx1->added_to_joinrel || !ctx1->useful)
+			continue;
+		foreach(lc2, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx2 = (UniqueKeyContext) lfirst(lc2);
+			if (ctx2->added_to_joinrel || !ctx2->useful)
+				continue;
+			if (add_combined_uniquekey(root, joinrel, outerrel, innerrel,
+									   ctx1->uniquekey, ctx2->uniquekey,
+									   jointype))
+				/* If we set a onerow UniqueKey to joinrel, we don't need other. */
+				return;
+		}
+	}
+}
+
+
+/*
+ * convert_subquery_uniquekeys
+ *
+ * Covert the UniqueKey in subquery to outer relation.
+ */
+void convert_subquery_uniquekeys(PlannerInfo *root,
+								 RelOptInfo *currel,
+								 RelOptInfo *sub_final_rel)
+{
+	ListCell	*lc;
+
+	if (sub_final_rel->uniquekeys == NIL)
+		return;
+
+	if (relation_is_onerow(sub_final_rel))
+	{
+		add_uniquekey_for_onerow(currel);
+		return;
+	}
+
+	Assert(currel->subroot != NULL);
+
+	foreach(lc, sub_final_rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		ListCell	*lc;
+		List	*exprs = NIL;
+		bool	ukey_useful = true;
+
+		/* One row case is handled above */
+		Assert(ukey->exprs != NIL);
+		foreach(lc, ukey->exprs)
+		{
+			Var *var;
+			TargetEntry *tle = tlist_member(lfirst(lc),
+											currel->subroot->processed_tlist);
+			if (tle == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			var = find_var_for_subquery_tle(currel, tle);
+			if (var == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			exprs = lappend(exprs, var);
+		}
+
+		if (ukey_useful)
+			currel->uniquekeys = lappend(currel->uniquekeys,
+										 makeUniqueKey(exprs,
+													   ukey->multi_nullvals));
+
+	}
+}
+
+/*
+ * innerrel_keeps_unique
+ *
+ * Check if Unique key of the innerrel is valid after join. innerrel's UniqueKey
+ * will be still valid if innerrel's any-column mergeop outrerel's uniquekey
+ * exists in clause_list.
+ *
+ * Note: the clause_list must be a list of mergeable restrictinfo already.
+ */
+static bool
+innerrel_keeps_unique(PlannerInfo *root,
+					  RelOptInfo *outerrel,
+					  RelOptInfo *innerrel,
+					  List *clause_list,
+					  bool reverse)
+{
+	ListCell	*lc, *lc2, *lc3;
+
+	if (outerrel->uniquekeys == NIL || innerrel->uniquekeys == NIL)
+		return false;
+
+	/* Check if there is outerrel's uniquekey in mergeable clause. */
+	foreach(lc, outerrel->uniquekeys)
+	{
+		List	*outer_uq_exprs = lfirst_node(UniqueKey, lc)->exprs;
+		bool clauselist_matchs_all_exprs = true;
+		foreach(lc2, outer_uq_exprs)
+		{
+			Node *outer_uq_expr = lfirst(lc2);
+			bool find_uq_expr_in_clauselist = false;
+			foreach(lc3, clause_list)
+			{
+				RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc3);
+				Node *outer_expr;
+				if (reverse)
+					outer_expr = rinfo->outer_is_left ? get_rightop(rinfo->clause) : get_leftop(rinfo->clause);
+				else
+					outer_expr = rinfo->outer_is_left ? get_leftop(rinfo->clause) : get_rightop(rinfo->clause);
+				if (equal(outer_expr, outer_uq_expr))
+				{
+					find_uq_expr_in_clauselist = true;
+					break;
+				}
+			}
+			if (!find_uq_expr_in_clauselist)
+			{
+				/* No need to check the next exprs in the current uniquekey */
+				clauselist_matchs_all_exprs = false;
+				break;
+			}
+		}
+
+		if (clauselist_matchs_all_exprs)
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * relation_is_onerow
+ * Check if it is a one-row relation by checking UniqueKey.
+ */
+bool
+relation_is_onerow(RelOptInfo *rel)
+{
+	UniqueKey *ukey;
+	if (rel->uniquekeys == NIL)
+		return false;
+	ukey = linitial_node(UniqueKey, rel->uniquekeys);
+	return ukey->exprs == NIL && list_length(rel->uniquekeys) == 1;
+}
+
+/*
+ * relation_has_uniquekeys_for
+ *		Returns true if we have proofs that 'rel' cannot return multiple rows with
+ *		the same values in each of 'exprs'.  Otherwise returns false.
+ */
+bool
+relation_has_uniquekeys_for(PlannerInfo *root, RelOptInfo *rel,
+							List *exprs, bool allow_multinulls)
+{
+	ListCell *lc;
+
+	/*
+	 * For UniqueKey->onerow case, the uniquekey->exprs is empty as well
+	 * so we can't rely on list_is_subset to handle this special cases
+	 */
+	if (exprs == NIL)
+		return false;
+
+	foreach(lc, rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		if (ukey->multi_nullvals && !allow_multinulls)
+			continue;
+		if (list_is_subset(ukey->exprs, exprs))
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * get_exprs_from_uniqueindex
+ *
+ * Return a list of exprs which is unique. set useful to false if this
+ * unique index is not useful for us.
+ */
+static List *
+get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+						   List *const_exprs,
+						   List *const_expr_opfamilies,
+						   Bitmapset *used_varattrs,
+						   bool *useful,
+						   bool *multi_nullvals)
+{
+	List	*exprs = NIL;
+	ListCell	*indexpr_item;
+	int	c = 0;
+
+	*useful = true;
+	*multi_nullvals = false;
+
+	indexpr_item = list_head(unique_index->indexprs);
+	for(c = 0; c < unique_index->ncolumns; c++)
+	{
+		int attr = unique_index->indexkeys[c];
+		Expr *expr;
+		bool	matched_const = false;
+		ListCell	*lc1, *lc2;
+
+		if(attr > 0)
+		{
+			expr = list_nth_node(TargetEntry, unique_index->indextlist, c)->expr;
+		}
+		else if (attr == 0)
+		{
+			/* Expression index */
+			expr = lfirst(indexpr_item);
+			indexpr_item = lnext(unique_index->indexprs, indexpr_item);
+		}
+		else /* attr < 0 */
+		{
+			/* Index on system column is not supported */
+			Assert(false);
+		}
+
+		/*
+		 * Check index_col = Const case with regarding to opfamily checking
+		 * If we can remove the index_col from the final UniqueKey->exprs.
+		 */
+		forboth(lc1, const_exprs, lc2, const_expr_opfamilies)
+		{
+			if (list_member_oid((List *)lfirst(lc2), unique_index->opfamily[c])
+				&& match_index_to_operand((Node *) lfirst(lc1), c, unique_index))
+			{
+				matched_const = true;
+				break;
+			}
+		}
+
+		if (matched_const)
+			continue;
+
+		/* Check if the indexed expr is used in rel */
+		if (attr > 0)
+		{
+			/*
+			 * Normal Indexed column, if the col is not used, then the index is useless
+			 * for uniquekey.
+			 */
+			attr -= FirstLowInvalidHeapAttributeNumber;
+
+			if (!bms_is_member(attr, used_varattrs))
+			{
+				*useful = false;
+				break;
+			}
+		}
+		else if (!list_member(unique_index->rel->reltarget->exprs, expr))
+		{
+			/* Expression index but the expression is not used in rel */
+			*useful = false;
+			break;
+		}
+
+		/* check not null property. */
+		if (attr == 0)
+		{
+			/* We never know if a expression yields null or not */
+			*multi_nullvals = true;
+		}
+		else if (!bms_is_member(attr, unique_index->rel->notnullattrs)
+				 && !bms_is_member(0 - FirstLowInvalidHeapAttributeNumber,
+								   unique_index->rel->notnullattrs))
+		{
+			*multi_nullvals = true;
+		}
+
+		exprs = lappend(exprs, expr);
+	}
+	return exprs;
+}
+
+
+/*
+ * add_uniquekey_for_onerow
+ * If we are sure that the relation only returns one row, then all the columns
+ * are unique. However we don't need to create UniqueKey for every column, we
+ * just set exprs = NIL and overwrites all the other UniqueKey on this RelOptInfo
+ * since this one has strongest semantics.
+ */
+void
+add_uniquekey_for_onerow(RelOptInfo *rel)
+{
+	/*
+	 * We overwrite the previous UniqueKey on purpose since this one has the
+	 * strongest semantic.
+	 */
+	rel->uniquekeys = list_make1(makeUniqueKey(NIL, false));
+}
+
+
+/*
+ * initililze_uniquecontext_for_joinrel
+ * Return a List of UniqueKeyContext for an inputrel
+ */
+static List *
+initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel)
+{
+	List	*res = NIL;
+	ListCell *lc;
+	foreach(lc,  inputrel->uniquekeys)
+	{
+		UniqueKeyContext context;
+		context = palloc(sizeof(struct UniqueKeyContextData));
+		context->uniquekey = lfirst_node(UniqueKey, lc);
+		context->added_to_joinrel = false;
+		context->useful = true;
+		res = lappend(res, context);
+	}
+	return res;
+}
+
+
+/*
+ * get_exprs_from_uniquekey
+ *	Unify the way of get List of exprs from a one-row UniqueKey or
+ * normal UniqueKey. for the onerow case, every expr in rel1 is a valid
+ * UniqueKey. Return a List of exprs.
+ *
+ * rel1: The relation which you want to get the exprs.
+ * ukey: The UniqueKey you want to get the exprs.
+ */
+static List *
+get_exprs_from_uniquekey(PlannerInfo *root, RelOptInfo *joinrel, RelOptInfo *rel1, UniqueKey *ukey)
+{
+	ListCell *lc;
+	bool onerow = rel1 != NULL && relation_is_onerow(rel1);
+
+	List	*res = NIL;
+	Assert(onerow || ukey);
+	if (onerow)
+	{
+		/* Only cares about the exprs still exist in joinrel */
+		foreach(lc, joinrel->reltarget->exprs)
+		{
+			Bitmapset *relids = pull_varnos(root, lfirst(lc));
+			if (bms_is_subset(relids, rel1->relids))
+			{
+				res = lappend(res, list_make1(lfirst(lc)));
+			}
+		}
+	}
+	else
+	{
+		res = list_make1(ukey->exprs);
+	}
+	return res;
+}
+
+/*
+ * Partitioned table Unique Keys.
+ * The partition table unique key is maintained as:
+ * 1. The index must be unique as usual.
+ * 2. The index must contains partition key.
+ * 3. The index must exist on all the child rel. see simple_indexinfo_equal for
+ *    how we compare it.
+ */
+
+/*
+ * index_constains_partkey
+ * return true if the index contains the partiton key.
+ */
+static bool
+index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind)
+{
+	ListCell	*lc;
+	int	i;
+	Assert(IS_PARTITIONED_REL(partrel));
+	Assert(partrel->part_scheme->partnatts > 0);
+
+	for(i = 0; i < partrel->part_scheme->partnatts; i++)
+	{
+		Node *part_expr = linitial(partrel->partexprs[i]);
+		bool found_in_index = false;
+		foreach(lc, ind->indextlist)
+		{
+			Expr *index_expr = lfirst_node(TargetEntry, lc)->expr;
+			if (equal(index_expr, part_expr))
+			{
+				found_in_index = true;
+				break;
+			}
+		}
+		if (!found_in_index)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * simple_indexinfo_equal
+ *
+ * Used to check if the 2 index is same as each other. The index here
+ * is COPIED from childrel and did some tiny changes(see
+ * simple_copy_indexinfo_to_parent)
+ */
+static bool
+simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2)
+{
+	Size oid_cmp_len = sizeof(Oid) * ind1->ncolumns;
+
+	return ind1->ncolumns == ind2->ncolumns &&
+		ind1->unique == ind2->unique &&
+		memcmp(ind1->indexkeys, ind2->indexkeys, sizeof(int) * ind1->ncolumns) == 0 &&
+		memcmp(ind1->opfamily, ind2->opfamily, oid_cmp_len) == 0 &&
+		memcmp(ind1->opcintype, ind2->opcintype, oid_cmp_len) == 0 &&
+		memcmp(ind1->sortopfamily, ind2->sortopfamily, oid_cmp_len) == 0 &&
+		equal(get_tlist_exprs(ind1->indextlist, true),
+			  get_tlist_exprs(ind2->indextlist, true));
+}
+
+
+/*
+ * The below macros are used for simple_copy_indexinfo_to_parent which is so
+ * customized that I don't want to put it to copyfuncs.c. So copy it here.
+ */
+#define COPY_POINTER_FIELD(fldname, sz) \
+	do { \
+		Size	_size = (sz); \
+		newnode->fldname = palloc(_size); \
+		memcpy(newnode->fldname, from->fldname, _size); \
+	} while (0)
+
+#define COPY_NODE_FIELD(fldname) \
+	(newnode->fldname = copyObjectImpl(from->fldname))
+
+#define COPY_SCALAR_FIELD(fldname) \
+	(newnode->fldname = from->fldname)
+
+
+/*
+ * simple_copy_indexinfo_to_parent (from partition)
+ * Copy the IndexInfo from child relation to parent relation with some modification,
+ * which is used to test:
+ * 1. If the same index exists in all the childrels.
+ * 2. If the parentrel->reltarget/basicrestrict info matches this index.
+ */
+static IndexOptInfo *
+simple_copy_indexinfo_to_parent(PlannerInfo *root,
+								RelOptInfo *parentrel,
+								IndexOptInfo *from)
+{
+	IndexOptInfo *newnode = makeNode(IndexOptInfo);
+	AppendRelInfo *appinfo = find_appinfo_by_child(root, from->rel->relid);
+	ListCell	*lc;
+	int	idx = 0;
+
+	COPY_SCALAR_FIELD(ncolumns);
+	COPY_SCALAR_FIELD(nkeycolumns);
+	COPY_SCALAR_FIELD(unique);
+	COPY_SCALAR_FIELD(immediate);
+	/* We just need to know if it is NIL or not */
+	COPY_SCALAR_FIELD(indpred);
+	COPY_SCALAR_FIELD(predOK);
+	COPY_POINTER_FIELD(indexkeys, from->ncolumns * sizeof(int));
+	COPY_POINTER_FIELD(indexcollations, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opfamily, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opcintype, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(sortopfamily, from->ncolumns * sizeof(Oid));
+	COPY_NODE_FIELD(indextlist);
+
+	/* Convert index exprs on child expr to expr on parent */
+	foreach(lc, newnode->indextlist)
+	{
+		TargetEntry *tle = lfirst_node(TargetEntry, lc);
+		/* Index on expression is ignored */
+		Assert(IsA(tle->expr, Var));
+		tle->expr = (Expr *) find_parent_var(appinfo, (Var *) tle->expr);
+		newnode->indexkeys[idx] = castNode(Var, tle->expr)->varattno;
+		idx++;
+	}
+	newnode->rel = parentrel;
+	return newnode;
+}
+
+/*
+ * adjust_partition_unique_indexlist
+ *
+ * global_unique_indexes: At the beginning, it contains the copy & modified
+ * unique index from the first partition. And then check if each index in it still
+ * exists in the following partitions. If no, remove it. at last, it has an
+ * index list which exists in all the partitions.
+ */
+static void
+adjust_partition_unique_indexlist(PlannerInfo *root,
+								  RelOptInfo *parentrel,
+								  RelOptInfo *childrel,
+								  List **global_unique_indexes)
+{
+	ListCell	*lc, *lc2;
+	foreach(lc, *global_unique_indexes)
+	{
+		IndexOptInfo	*g_ind = lfirst_node(IndexOptInfo, lc);
+		bool found_in_child = false;
+
+		foreach(lc2, childrel->indexlist)
+		{
+			IndexOptInfo   *p_ind = lfirst_node(IndexOptInfo, lc2);
+			IndexOptInfo   *p_ind_copy;
+			if (!p_ind->unique || !p_ind->immediate ||
+				(p_ind->indpred != NIL && !p_ind->predOK))
+				continue;
+			p_ind_copy = simple_copy_indexinfo_to_parent(root, parentrel, p_ind);
+			if (simple_indexinfo_equal(p_ind_copy, g_ind))
+			{
+				found_in_child = true;
+				break;
+			}
+		}
+		if (!found_in_child)
+			/* The index doesn't exist in childrel, remove it from global_unique_indexes */
+			*global_unique_indexes = foreach_delete_current(*global_unique_indexes, lc);
+	}
+}
+
+/* Helper function for groupres/distinctrel */
+static void
+add_uniquekey_from_sortgroups(PlannerInfo *root, RelOptInfo *rel, List *sortgroups)
+{
+	Query *parse = root->parse;
+	List	*exprs;
+
+	/*
+	 * XXX: If there are some vars which is not in current levelsup, the semantic is
+	 * imprecise, should we avoid it or not? levelsup = 1 is just a demo, maybe we need to
+	 * check every level other than 0, if so, looks we have to write another
+	 * pull_var_walker.
+	 */
+	List	*upper_vars = pull_vars_of_level((Node*)sortgroups, 1);
+
+	if (upper_vars != NIL)
+		return;
+
+	exprs = get_sortgrouplist_exprs(sortgroups, parse->targetList);
+	rel->uniquekeys = lappend(rel->uniquekeys,
+							  makeUniqueKey(exprs,
+											false /* sortgroupclause can't be multi_nullvals */));
+}
+
+
+/*
+ * add_combined_uniquekey
+ * The combination of both UniqueKeys is a valid UniqueKey for joinrel no matter
+ * the jointype.
+ */
+bool
+add_combined_uniquekey(PlannerInfo *root,
+					   RelOptInfo *joinrel,
+					   RelOptInfo *outer_rel,
+					   RelOptInfo *inner_rel,
+					   UniqueKey *outer_ukey,
+					   UniqueKey *inner_ukey,
+					   JoinType jointype)
+{
+
+	ListCell	*lc1, *lc2;
+
+	/* Either side has multi_nullvals or we have outer join,
+	 * the combined UniqueKey has multi_nullvals */
+	bool multi_nullvals = outer_ukey->multi_nullvals ||
+		inner_ukey->multi_nullvals || IS_OUTER_JOIN(jointype);
+
+	/* The only case we can get onerow joinrel after join */
+	if  (relation_is_onerow(outer_rel)
+		 && relation_is_onerow(inner_rel)
+		 && jointype == JOIN_INNER)
+	{
+		add_uniquekey_for_onerow(joinrel);
+		return true;
+	}
+
+	foreach(lc1, get_exprs_from_uniquekey(root, joinrel, outer_rel, outer_ukey))
+	{
+		foreach(lc2, get_exprs_from_uniquekey(root, joinrel, inner_rel, inner_ukey))
+		{
+			List *exprs = list_concat_copy(lfirst_node(List, lc1), lfirst_node(List, lc2));
+			joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+										  makeUniqueKey(exprs,
+														multi_nullvals));
+		}
+	}
+	return false;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index bd01ec0526..3cadd22e3e 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -187,8 +187,7 @@ static void create_one_window_path(PlannerInfo *root,
 								   PathTarget *output_target,
 								   WindowFuncLists *wflists,
 								   List *activeWindows);
-static RelOptInfo *create_distinct_paths(PlannerInfo *root,
-										 RelOptInfo *input_rel);
+static RelOptInfo *create_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel);
 static void create_partial_distinct_paths(PlannerInfo *root,
 										  RelOptInfo *input_rel,
 										  RelOptInfo *final_distinct_rel);
@@ -1866,6 +1865,8 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
 		add_path(final_rel, path);
 	}
 
+	simple_copy_uniquekeys(current_rel, final_rel);
+
 	/*
 	 * Generate partial paths for final_rel, too, if outer query levels might
 	 * be able to make use of them.
@@ -3380,6 +3381,8 @@ create_grouping_paths(PlannerInfo *root,
 	}
 
 	set_cheapest(grouped_rel);
+
+	populate_grouprel_uniquekeys(root, grouped_rel, input_rel);
 	return grouped_rel;
 }
 
@@ -4102,7 +4105,7 @@ create_window_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(window_rel);
-
+	simple_copy_uniquekeys(input_rel, window_rel);
 	return window_rel;
 }
 
@@ -4291,6 +4294,7 @@ create_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel)
 
 	/* Now choose the best path(s) */
 	set_cheapest(distinct_rel);
+	populate_distinctrel_uniquekeys(root, input_rel, distinct_rel);
 
 	return distinct_rel;
 }
@@ -4823,6 +4827,8 @@ create_ordered_paths(PlannerInfo *root,
 	 */
 	Assert(ordered_rel->pathlist != NIL);
 
+	simple_copy_uniquekeys(input_rel, ordered_rel);
+
 	return ordered_rel;
 }
 
@@ -5700,6 +5706,9 @@ adjust_paths_for_srfs(PlannerInfo *root, RelOptInfo *rel,
 	if (list_length(targets) == 1)
 		return;
 
+	/* UniqueKey is not valid after handling the SRF. */
+	rel->uniquekeys = NIL;
+
 	/*
 	 * Stack SRF-evaluation nodes atop each path for the rel.
 	 *
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index e9256a2d4d..f6e836d8e4 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -690,6 +690,8 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/* Undo effects of possibly forcing tuple_fraction to 0 */
 	root->tuple_fraction = save_fraction;
 
+	/* Add the UniqueKeys */
+	populate_unionrel_uniquekeys(root, result_rel);
 	return result_rel;
 }
 
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index af46f581ac..2bf4ffd618 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -1000,3 +1000,47 @@ distribute_row_identity_vars(PlannerInfo *root)
 		}
 	}
 }
+
+/*
+ * find_appinfo_by_child
+ *
+ */
+AppendRelInfo *
+find_appinfo_by_child(PlannerInfo *root, Index child_index)
+{
+	ListCell	*lc;
+	foreach(lc, root->append_rel_list)
+	{
+		AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
+		if (appinfo->child_relid == child_index)
+			return appinfo;
+	}
+	elog(ERROR, "parent relation cant be found");
+	return NULL;
+}
+
+/*
+ * find_parent_var
+ *
+ */
+Var *
+find_parent_var(AppendRelInfo *appinfo, Var *child_var)
+{
+	ListCell	*lc;
+	Var	*res = NULL;
+	Index attno = 1;
+	foreach(lc, appinfo->translated_vars)
+	{
+		Node *child_node = lfirst(lc);
+		if (equal(child_node, child_var))
+		{
+			res = copyObject(child_var);
+			res->varattno = attno;
+			res->varno = appinfo->parent_relid;
+		}
+		attno++;
+	}
+	if (res == NULL)
+		elog(ERROR, "parent var can't be found.");
+	return res;
+}
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index c758172efa..0cf3f3e866 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -797,6 +797,7 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 		{
 			Node	   *onecq = (Node *) lfirst(lc2);
 			bool		pseudoconstant;
+			RestrictInfo	*child_rinfo;
 
 			/* check for pseudoconstant (no Vars or volatile functions) */
 			pseudoconstant =
@@ -808,14 +809,15 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 				root->hasPseudoConstantQuals = true;
 			}
 			/* reconstitute RestrictInfo with appropriate properties */
-			childquals = lappend(childquals,
-								 make_restrictinfo(root,
-												   (Expr *) onecq,
-												   rinfo->is_pushed_down,
-												   rinfo->outerjoin_delayed,
-												   pseudoconstant,
-												   rinfo->security_level,
-												   NULL, NULL, NULL));
+			child_rinfo =  make_restrictinfo(root,
+											 (Expr *) onecq,
+											 rinfo->is_pushed_down,
+											 rinfo->outerjoin_delayed,
+											 pseudoconstant,
+											 rinfo->security_level,
+											 NULL, NULL, NULL);
+			child_rinfo->mergeopfamilies = rinfo->mergeopfamilies;
+			childquals = lappend(childquals, child_rinfo);
 			/* track minimum security level among child quals */
 			cq_min_security = Min(cq_min_security, rinfo->security_level);
 		}
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index eea87f847d..d2eebd3271 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -16,6 +16,7 @@
 
 #include "nodes/execnodes.h"
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 
 extern A_Expr *makeA_Expr(A_Expr_Kind kind, List *name,
@@ -106,4 +107,6 @@ extern GroupingSet *makeGroupingSet(GroupingSetKind kind, List *content, int loc
 
 extern VacuumRelation *makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols);
 
+extern UniqueKey* makeUniqueKey(List *exprs, bool multi_nullvals);
+
 #endif							/* MAKEFUNC_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index e0057daa06..590c907d75 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -267,6 +267,7 @@ typedef enum NodeTag
 	T_EquivalenceMember,
 	T_PathKey,
 	T_PathTarget,
+	T_UniqueKey,
 	T_RestrictInfo,
 	T_IndexClause,
 	T_PlaceHolderVar,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index c624611784..0c758d10c5 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -748,6 +748,7 @@ typedef struct RelOptInfo
 	QualCost	baserestrictcost;	/* cost of evaluating the above */
 	Index		baserestrict_min_security;	/* min security_level found in
 											 * baserestrictinfo */
+	List	   *uniquekeys;		/* List of UniqueKey */
 	List	   *joininfo;		/* RestrictInfo structures for join clauses
 								 * involving this rel */
 	bool		has_eclass_joins;	/* T means joininfo is incomplete */
@@ -1083,6 +1084,28 @@ typedef enum VolatileFunctionStatus
 	VOLATILITY_NOVOLATILE
 } VolatileFunctionStatus;
 
+/*
+ * UniqueKey
+ *
+ * Represents the unique properties held by a RelOptInfo.
+ *
+ * exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+ * is a special case of UniqueKey, which means there is only 1 row in that
+ * relation.
+ * multi_nullvals: true means multi null values may exist in these exprs, so the
+ * uniqueness is not guaranteed in this case. This field is necessary for
+ * remove_useless_join & reduce_unique_semijoins where we don't mind these
+ * duplicated NULL values. It is set to true for 2 cases. One is a unique key
+ * from a unique index but the related column is nullable. The other one is for
+ * outer join. see populate_joinrel_uniquekeys for detail.
+ */
+typedef struct UniqueKey
+{
+	NodeTag		type;
+	List	   *exprs;
+	bool		multi_nullvals;
+} UniqueKey;
+
 /*
  * PathTarget
  *
@@ -2578,7 +2601,7 @@ typedef enum
  *
  * flags indicating what kinds of grouping are possible.
  * partial_costs_set is true if the agg_partial_costs and agg_final_costs
- * 		have been initialized.
+ *		have been initialized.
  * agg_partial_costs gives partial aggregation costs.
  * agg_final_costs gives finalization costs.
  * target_parallel_safe is true if target is parallel safe.
@@ -2608,8 +2631,8 @@ typedef struct
  * limit_tuples is an estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate.
  * count_est and offset_est are the estimated values of the LIMIT and OFFSET
- * 		expressions computed by preprocess_limit() (see comments for
- * 		preprocess_limit() for more information).
+ *		expressions computed by preprocess_limit() (see comments for
+ *		preprocess_limit() for more information).
  */
 typedef struct
 {
diff --git a/src/include/nodes/pg_list.h b/src/include/nodes/pg_list.h
index 30f98c4595..cc466eceaf 100644
--- a/src/include/nodes/pg_list.h
+++ b/src/include/nodes/pg_list.h
@@ -558,6 +558,7 @@ extern bool list_member_ptr(const List *list, const void *datum);
 extern bool list_member_int(const List *list, int datum);
 extern bool list_member_oid(const List *list, Oid datum);
 
+extern bool list_is_subset(const List *members, const List *target);
 extern pg_nodiscard List *list_delete(List *list, void *datum);
 extern pg_nodiscard List *list_delete_ptr(List *list, void *datum);
 extern pg_nodiscard List *list_delete_int(List *list, int datum);
diff --git a/src/include/optimizer/appendinfo.h b/src/include/optimizer/appendinfo.h
index 39d04d9cc0..2abc26e500 100644
--- a/src/include/optimizer/appendinfo.h
+++ b/src/include/optimizer/appendinfo.h
@@ -47,4 +47,7 @@ extern void add_row_identity_columns(PlannerInfo *root, Index rtindex,
 									 Relation target_relation);
 extern void distribute_row_identity_vars(PlannerInfo *root);
 
+extern AppendRelInfo *find_appinfo_by_child(PlannerInfo *root, Index child_index);
+extern Var *find_parent_var(AppendRelInfo *appinfo, Var *child_var);
+
 #endif							/* APPENDINFO_H */
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index 41b49b2662..81c71119f6 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -23,6 +23,7 @@
 #define OPTIMIZER_H
 
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 /* Test if an expression node represents a SRF call.  Beware multiple eval! */
 #define IS_SRF_CALL(node) \
@@ -171,6 +172,7 @@ extern TargetEntry *get_sortgroupref_tle(Index sortref,
 										 List *targetList);
 extern TargetEntry *get_sortgroupclause_tle(SortGroupClause *sgClause,
 											List *targetList);
+extern Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 extern Node *get_sortgroupclause_expr(SortGroupClause *sgClause,
 									  List *targetList);
 extern List *get_sortgrouplist_exprs(List *sgClauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index f1d111063c..754dfcd549 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -254,5 +254,48 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 									   int strategy, bool nulls_first);
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 									List *live_childrels);
+extern List *select_mergejoin_clauses(PlannerInfo *root,
+									  RelOptInfo *joinrel,
+									  RelOptInfo *outerrel,
+									  RelOptInfo *innerrel,
+									  List *restrictlist,
+									  JoinType jointype,
+									  bool *mergejoin_allowed);
+
+/*
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ */
+extern void populate_baserel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *baserel,
+										List* unique_index_list);
+extern void populate_partitionedrel_uniquekeys(PlannerInfo *root,
+												RelOptInfo *rel,
+												List *childrels);
+extern void populate_distinctrel_uniquekeys(PlannerInfo *root,
+											RelOptInfo *inputrel,
+											RelOptInfo *distinctrel);
+extern void populate_grouprel_uniquekeys(PlannerInfo *root,
+										 RelOptInfo *grouprel,
+										 RelOptInfo *inputrel);
+extern void populate_unionrel_uniquekeys(PlannerInfo *root,
+										  RelOptInfo *unionrel);
+extern void simple_copy_uniquekeys(RelOptInfo *oldrel,
+								   RelOptInfo *newrel);
+extern void convert_subquery_uniquekeys(PlannerInfo *root,
+										RelOptInfo *currel,
+										RelOptInfo *sub_final_rel);
+extern void populate_joinrel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *joinrel,
+										RelOptInfo *rel1,
+										RelOptInfo *rel2,
+										List *restrictlist,
+										JoinType jointype);
+
+extern bool relation_has_uniquekeys_for(PlannerInfo *root,
+										RelOptInfo *rel,
+										List *exprs,
+										bool allow_multinulls);
+extern bool relation_is_onerow(RelOptInfo *rel);
 
 #endif							/* PATHS_H */
-- 
2.29.2

v2-0001-Extend-UniqueKeys.patchapplication/octet-stream; name=v2-0001-Extend-UniqueKeys.patchDownload

From c5023315eaaf9e1b4e338ce488c92f2e85f56543 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:33:56 +0200
Subject: [PATCH 3/5] Extend UniqueKeys

Prepares index skip scan implementation using UniqueKeys. Allows to
specify what are the "requested" keys that should be unique, and add
them to necessary Paths to make them useful later.

Proposed by David Rowley, contains few bits out of previous version from
Jesper Pedersen.
---
 src/backend/optimizer/path/pathkeys.c   | 59 +++++++++++++++++++++++
 src/backend/optimizer/path/uniquekeys.c | 63 +++++++++++++++++++++++++
 src/backend/optimizer/plan/planner.c    | 36 +++++++++++++-
 src/backend/optimizer/util/pathnode.c   | 32 +++++++++----
 src/include/nodes/pathnodes.h           |  5 ++
 src/include/optimizer/pathnode.h        |  1 +
 src/include/optimizer/paths.h           |  8 ++++
 7 files changed, 194 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 8b267de06f..ad4fe19872 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
 #include "utils/lsyscache.h"
 
 
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
@@ -95,6 +96,29 @@ make_canonical_pathkey(PlannerInfo *root,
 	return pk;
 }
 
+/*
+ * pathkey_is_unique
+ *	   Checks if the new pathkey's equivalence class is the same as that of
+ *     any existing member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+	EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+	ListCell   *lc;
+
+	/* If same EC already is already in the list, then not unique */
+	foreach(lc, pathkeys)
+	{
+		PathKey    *old_pathkey = (PathKey *) lfirst(lc);
+
+		if (new_ec == old_pathkey->pk_eclass)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * pathkey_is_redundant
  *	   Is a pathkey redundant with one already in the given list?
@@ -1151,6 +1175,41 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
 	return pathkeys;
 }
 
+/*
+ * make_pathkeys_for_uniquekeyclauses
+ *		Generate a pathkeys list to be used for uniquekey clauses
+ */
+List *
+make_pathkeys_for_uniquekeys(PlannerInfo *root,
+							 List *sortclauses,
+							 List *tlist)
+{
+	List	   *pathkeys = NIL;
+	ListCell   *l;
+
+	foreach(l, sortclauses)
+	{
+		SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+		Expr	   *sortkey;
+		PathKey    *pathkey;
+
+		sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+		Assert(OidIsValid(sortcl->sortop));
+		pathkey = make_pathkey_from_sortop(root,
+										   sortkey,
+										   root->nullable_baserels,
+										   sortcl->sortop,
+										   sortcl->nulls_first,
+										   sortcl->tleSortGroupRef,
+										   true);
+
+		if (pathkey_is_unique(pathkey, pathkeys))
+			pathkeys = lappend(pathkeys, pathkey);
+	}
+
+	return pathkeys;
+}
+
 /****************************************************************************
  *		PATHKEYS AND MERGECLAUSES
  ****************************************************************************/
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
index ca40c40858..ab4b1d1939 100644
--- a/src/backend/optimizer/path/uniquekeys.c
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -1132,3 +1132,66 @@ add_combined_uniquekey(PlannerInfo *root,
 	}
 	return false;
 }
+
+List*
+build_uniquekeys(PlannerInfo *root, List *sortclauses)
+{
+	List *result = NIL;
+	List *sortkeys;
+	ListCell *l;
+	List *exprs = NIL;
+
+	sortkeys = make_pathkeys_for_uniquekeys(root,
+											sortclauses,
+											root->processed_tlist);
+
+	/* Create a uniquekey and add it to the list */
+	foreach(l, sortkeys)
+	{
+		PathKey    *pathkey = (PathKey *) lfirst(l);
+		EquivalenceClass *ec = pathkey->pk_eclass;
+		EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members));
+		if (EC_MUST_BE_REDUNDANT(ec))
+			continue;
+		exprs = lappend(exprs, mem->em_expr);
+	}
+
+	result = lappend(result, makeUniqueKey(exprs, false));
+
+	return result;
+}
+
+bool
+query_has_uniquekeys_for(PlannerInfo *root, List *pathuniquekeys,
+						 bool allow_multinulls)
+{
+	ListCell *lc;
+	ListCell *lc2;
+
+	/* root->query_uniquekeys are the requested DISTINCT clauses on query level
+	 * pathuniquekeys are the unique keys on current path.
+	 * All requested query_uniquekeys must be satisfied by the pathuniquekeys
+	 */
+	foreach(lc, root->query_uniquekeys)
+	{
+		UniqueKey *query_ukey = lfirst_node(UniqueKey, lc);
+		bool satisfied = false;
+		foreach(lc2, pathuniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc2);
+			if (ukey->multi_nullvals && !allow_multinulls)
+				continue;
+			if (list_length(ukey->exprs) == 0 &&
+				list_length(query_ukey->exprs) != 0)
+				continue;
+			if (list_is_subset(ukey->exprs, query_ukey->exprs))
+			{
+				satisfied = true;
+				break;
+			}
+		}
+		if (!satisfied)
+			return false;
+	}
+	return true;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 3cadd22e3e..ea2408c13f 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3082,12 +3082,18 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 	 */
 	if (qp_extra->groupClause &&
 		grouping_is_sortable(qp_extra->groupClause))
+	{
 		root->group_pathkeys =
 			make_pathkeys_for_sortclauses(root,
 										  qp_extra->groupClause,
 										  tlist);
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
+	{
 		root->group_pathkeys = NIL;
+		root->query_uniquekeys = NIL;
+	}
 
 	/* We consider only the first (bottom) window in pathkeys logic */
 	if (activeWindows != NIL)
@@ -4497,13 +4503,19 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
 			Path	   *path = (Path *) lfirst(lc);
 
 			if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
-			{
 				add_path(distinct_rel, (Path *)
 						 create_upper_unique_path(root, distinct_rel,
 												  path,
 												  list_length(root->distinct_pathkeys),
 												  numDistinctRows));
-			}
+		}
+
+		foreach(lc, input_rel->unique_pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+
+			if (query_has_uniquekeys_for(root, needed_pathkeys, false))
+				add_path(distinct_rel, path);
 		}
 
 		/* For explicit-sort case, always use the more rigorous clause */
@@ -7118,6 +7130,26 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		}
 	}
 
+	foreach(lc, rel->unique_pathlist)
+	{
+		Path	   *subpath = (Path *) lfirst(lc);
+
+		/* Shouldn't have any parameterized paths anymore */
+		Assert(subpath->param_info == NULL);
+
+		if (tlist_same_exprs)
+			subpath->pathtarget->sortgrouprefs =
+				scanjoin_target->sortgrouprefs;
+		else
+		{
+			Path	   *newpath;
+
+			newpath = (Path *) create_projection_path(root, rel, subpath,
+													  scanjoin_target);
+			lfirst(lc) = newpath;
+		}
+	}
+
 	/*
 	 * Now, if final scan/join target contains SRFs, insert ProjectSetPath(s)
 	 * atop each existing path.  (Note that this function doesn't look at the
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e53d381e19..74e100e5a9 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -416,10 +416,10 @@ set_cheapest(RelOptInfo *parent_rel)
  * 'parent_rel' is the relation entry to which the path corresponds.
  * 'new_path' is a potential path for parent_rel.
  *
- * Returns nothing, but modifies parent_rel->pathlist.
+ * Returns modified pathlist.
  */
-void
-add_path(RelOptInfo *parent_rel, Path *new_path)
+static List *
+add_path_to(RelOptInfo *parent_rel, List *pathlist, Path *new_path)
 {
 	bool		accept_new = true;	/* unless we find a superior old path */
 	int			insert_at = 0;	/* where to insert new item */
@@ -440,7 +440,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	 * for more than one old path to be tossed out because new_path dominates
 	 * it.
 	 */
-	foreach(p1, parent_rel->pathlist)
+	foreach(p1, pathlist)
 	{
 		Path	   *old_path = (Path *) lfirst(p1);
 		bool		remove_old = false; /* unless new proves superior */
@@ -584,8 +584,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		 */
 		if (remove_old)
 		{
-			parent_rel->pathlist = foreach_delete_current(parent_rel->pathlist,
-														  p1);
+			pathlist = foreach_delete_current(pathlist, p1);
 
 			/*
 			 * Delete the data pointed-to by the deleted cell, if possible
@@ -612,8 +611,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	if (accept_new)
 	{
 		/* Accept the new path: insert it at proper place in pathlist */
-		parent_rel->pathlist =
-			list_insert_nth(parent_rel->pathlist, insert_at, new_path);
+		pathlist = list_insert_nth(pathlist, insert_at, new_path);
 	}
 	else
 	{
@@ -621,6 +619,23 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		if (!IsA(new_path, IndexPath))
 			pfree(new_path);
 	}
+
+	return pathlist;
+}
+
+void
+add_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->pathlist = add_path_to(parent_rel,
+									   parent_rel->pathlist, new_path);
+}
+
+void
+add_unique_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->unique_pathlist = add_path_to(parent_rel,
+											  parent_rel->unique_pathlist,
+											  new_path);
 }
 
 /*
@@ -2661,6 +2676,7 @@ create_projection_path(PlannerInfo *root,
 	pathnode->path.pathkeys = subpath->pathkeys;
 
 	pathnode->subpath = subpath;
+	pathnode->path.uniquekeys = subpath->uniquekeys;
 
 	/*
 	 * We might not need a separate Result node.  If the input plan node type
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0c758d10c5..3ae6b91576 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -293,6 +293,7 @@ struct PlannerInfo
 
 	List	   *query_pathkeys; /* desired pathkeys for query_planner() */
 
+	List	   *query_uniquekeys; /* unique keys required for the query */
 	List	   *group_pathkeys; /* groupClause pathkeys, if any */
 	List	   *window_pathkeys;	/* pathkeys of bottom window, if any */
 	List	   *distinct_pathkeys;	/* distinctClause pathkeys, if any */
@@ -695,6 +696,7 @@ typedef struct RelOptInfo
 	List	   *pathlist;		/* Path structures */
 	List	   *ppilist;		/* ParamPathInfos used in pathlist */
 	List	   *partial_pathlist;	/* partial Paths */
+	List       *unique_pathlist;    /* unique Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
 	struct Path *cheapest_unique_path;
@@ -886,6 +888,7 @@ struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanskip;		/* can AM skip duplicate values? */
 	bool		amcanparallel;	/* does AM support parallel scan? */
 	bool		amcanmarkpos;	/* does AM support mark/restore? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
@@ -1220,6 +1223,8 @@ typedef struct Path
 
 	List	   *pathkeys;		/* sort ordering of path's output */
 	/* pathkeys is a List of PathKey nodes; see above */
+
+	List	   *uniquekeys;	/* the unique keys, or NIL if none */
 } Path;
 
 /* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index f704d39980..facb2dfe74 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -27,6 +27,7 @@ extern int	compare_fractional_path_costs(Path *path1, Path *path2,
 										  double fraction);
 extern void set_cheapest(RelOptInfo *parent_rel);
 extern void add_path(RelOptInfo *parent_rel, Path *new_path);
+extern void add_unique_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_path_precheck(RelOptInfo *parent_rel,
 							  Cost startup_cost, Cost total_cost,
 							  List *pathkeys, Relids required_outer);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 754dfcd549..e71e65264a 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -229,6 +229,9 @@ extern List *build_join_pathkeys(PlannerInfo *root,
 extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
 										   List *sortclauses,
 										   List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+										  List *sortclauses,
+										  List *tlist);
 extern void initialize_mergeclause_eclasses(PlannerInfo *root,
 											RestrictInfo *restrictinfo);
 extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -296,6 +299,11 @@ extern bool relation_has_uniquekeys_for(PlannerInfo *root,
 										RelOptInfo *rel,
 										List *exprs,
 										bool allow_multinulls);
+extern bool query_has_uniquekeys_for(PlannerInfo *root,
+									 List *exprs,
+									 bool allow_multinulls);
 extern bool relation_is_onerow(RelOptInfo *rel);
 
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+
 #endif							/* PATHS_H */
-- 
2.29.2

v2-0002-Index-skip-scan.patchapplication/octet-stream; name=v2-0002-Index-skip-scan.patchDownload

From 31217a6e401861e25620720132addfddd2a2f5b5 Mon Sep 17 00:00:00 2001
From: Floris van Nee <floris.vannee@gmail.com>
Date: Fri, 15 Nov 2019 09:46:53 -0500
Subject: [PATCH 4/5] Index skip scan

Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
as part of the IndexOnlyScan, IndexScan and BitmapIndexScan for nbtree.
This patch improves performance of two main types of queries significantly:
- SELECT DISTINCT, SELECT DISTINCT ON
- Regular SELECTs with WHERE-clauses on non-leading index attributes
For example, given an nbtree index on three columns (a,b,c), the following queries
may now be significantly faster:
- SELECT DISTINCT ON (a) * FROM t1
- SELECT * FROM t1 WHERE b=2
- SELECT * FROM t1 WHERE b IN (10,40)
- SELECT DISTINCT ON (a,b) * FROM t1 WHERE c BETWEEN 10 AND 100 ORDER BY a,b,c

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen. Further enhanced functionality
added by Floris van Nee regarding a more general and performant skip implementation.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com

Author: Floris van Nee, Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 contrib/amcheck/verify_nbtree.c               |    4 +-
 contrib/bloom/blutils.c                       |    3 +
 doc/src/sgml/config.sgml                      |   15 +
 doc/src/sgml/indexam.sgml                     |  121 +-
 doc/src/sgml/indices.sgml                     |   28 +
 src/backend/access/brin/brin.c                |    3 +
 src/backend/access/gin/ginutil.c              |    3 +
 src/backend/access/gist/gist.c                |    3 +
 src/backend/access/hash/hash.c                |    3 +
 src/backend/access/index/indexam.c            |  163 ++
 src/backend/access/nbtree/Makefile            |    1 +
 src/backend/access/nbtree/nbtinsert.c         |    2 +-
 src/backend/access/nbtree/nbtpage.c           |    2 +-
 src/backend/access/nbtree/nbtree.c            |   58 +-
 src/backend/access/nbtree/nbtsearch.c         |  790 ++++-----
 src/backend/access/nbtree/nbtskip.c           | 1455 +++++++++++++++++
 src/backend/access/nbtree/nbtsort.c           |    2 +-
 src/backend/access/nbtree/nbtutils.c          |  850 +++++++++-
 src/backend/access/spgist/spgutils.c          |    3 +
 src/backend/commands/explain.c                |   29 +
 src/backend/executor/execScan.c               |   37 +-
 src/backend/executor/nodeBitmapIndexscan.c    |   22 +-
 src/backend/executor/nodeIndexonlyscan.c      |   69 +-
 src/backend/executor/nodeIndexscan.c          |   72 +-
 src/backend/nodes/copyfuncs.c                 |    5 +
 src/backend/nodes/outfuncs.c                  |    6 +
 src/backend/nodes/readfuncs.c                 |    5 +
 src/backend/optimizer/path/allpaths.c         |   54 +-
 src/backend/optimizer/path/costsize.c         |    1 +
 src/backend/optimizer/path/indxpath.c         |   68 +
 src/backend/optimizer/path/pathkeys.c         |   72 +
 src/backend/optimizer/plan/createplan.c       |   38 +-
 src/backend/optimizer/plan/planner.c          |   16 +-
 src/backend/optimizer/util/pathnode.c         |   78 +
 src/backend/optimizer/util/plancat.c          |    3 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 src/backend/utils/sort/tuplesort.c            |    4 +-
 src/include/access/amapi.h                    |   19 +
 src/include/access/genam.h                    |   16 +
 src/include/access/nbtree.h                   |  143 +-
 src/include/executor/executor.h               |    4 +
 src/include/nodes/execnodes.h                 |    7 +
 src/include/nodes/pathnodes.h                 |    6 +
 src/include/nodes/plannodes.h                 |    5 +
 src/include/optimizer/cost.h                  |    1 +
 src/include/optimizer/pathnode.h              |    4 +
 src/include/optimizer/paths.h                 |    4 +
 src/interfaces/libpq/encnames.c               |    1 +
 src/interfaces/libpq/wchar.c                  |    1 +
 src/test/regress/expected/select_distinct.out |  599 +++++++
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/sql/select_distinct.sql      |  248 +++
 53 files changed, 4613 insertions(+), 546 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtskip.c
 create mode 120000 src/interfaces/libpq/encnames.c
 create mode 120000 src/interfaces/libpq/wchar.c

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 42a830c33b..2849e4bc72 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2652,7 +2652,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, itup, NULL);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -3108,7 +3108,7 @@ bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, itup, NULL);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 754de008d4..d35edd1d09 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -134,6 +134,9 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = blcostestimate;
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0bcc6fd322..5c377fea08 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5007,6 +5007,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+      <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of index-skip-scan plan
+        types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+        <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-material" xreflabel="enable_material">
       <term><varname>enable_material</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index cf359fa9ff..1ced71a45f 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -151,6 +151,9 @@ typedef struct IndexAmRoutine
     amendscan_function amendscan;
     ammarkpos_function ammarkpos;       /* can be NULL */
     amrestrpos_function amrestrpos;     /* can be NULL */
+    amskip_function amskip;                        /* can be NULL */
+    ambeginscan_skip_function ambeginskipscan;     /* can be NULL */
+    amgettuple_with_skip_function amgetskiptuple;  /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -751,6 +754,122 @@ amrestrpos (IndexScanDesc scan);
    struct may be set to NULL.
   </para>
 
+  <para>
+<programlisting>
+bool
+amskip (IndexScanDesc scan,
+        ScanDirection prefixDir,
+	ScanDirection postfixDir);
+</programlisting>
+  Skip past all tuples where the first 'prefix' columns have the same value as
+  the last tuple returned in the current scan. The arguments are:
+
+   <variablelist>
+    <varlistentry>
+     <term><parameter>scan</parameter></term>
+     <listitem>
+      <para>
+       Index scan information
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>prefixDir</parameter></term>
+     <listitem>
+      <para>
+       The direction in which the prefix part of the tuple is advancing.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>postfixDir</parameter></term>
+     <listitem>
+      <para>
+        The direction in which the postfix (everything after the prefix) of the tuple is advancing.
+      </para>
+     </listitem>
+    </varlistentry>
+   </variablelist>
+
+  </para>
+  <para>
+<programlisting>
+IndexScanDesc
+ambeginscan_skip (Relation indexRelation,
+             int nkeys,
+	     int norderbys,
+	     int prefix);
+</programlisting>
+   Prepare for an index scan.  The <literal>nkeys</literal> and <literal>norderbys</literal>
+   parameters indicate the number of quals and ordering operators that will be
+   used in the scan; these may be useful for space allocation purposes.
+   Note that the actual values of the scan keys aren't provided yet.
+   The result must be a palloc'd struct.
+   For implementation reasons the index access method
+   <emphasis>must</emphasis> create this struct by calling
+   <function>RelationGetIndexScan()</function>.  In most cases
+   <function>ambeginscan</function> does little beyond making that call and perhaps
+   acquiring locks;
+   the interesting parts of index-scan startup are in <function>amrescan</function>.
+   If this is a skip scan, prefix must indicate the length of the prefix that can be
+   skipped over. Prefix can be set to -1 to disable skipping, which will yield an
+   identical scan to a regular call to <function>ambeginscan</function>.
+  </para>
+  <para>
+  <programlisting>
+  boolean
+  amgettuple_skip (IndexScanDesc scan,
+              ScanDirection prefixDir,
+	      ScanDirection postfixDir);
+  </programlisting>
+     Fetch the next tuple in the given scan, moving in the given
+     directions. Directions are specified by the direction of the prefix we're moving in,
+     of which the size of the prefix has been specified in the <function>btbeginscan_skip</function>
+     call. This direction can be different in DISTINCT scans when fetching backwards
+     from a cursor.
+     Returns true if a tuple was
+     obtained, false if no matching tuples remain.  In the true case the tuple
+     TID is stored into the <literal>scan</literal> structure.  Note that
+     <quote>success</quote> means only that the index contains an entry that matches
+     the scan keys, not that the tuple necessarily still exists in the heap or
+     will pass the caller's snapshot test.  On success, <function>amgettuple</function>
+     must also set <literal>scan-&gt;xs_recheck</literal> to true or false.
+     False means it is certain that the index entry matches the scan keys.
+     true means this is not certain, and the conditions represented by the
+     scan keys must be rechecked against the heap tuple after fetching it.
+     This provision supports <quote>lossy</quote> index operators.
+     Note that rechecking will extend only to the scan conditions; a partial
+     index predicate (if any) is never rechecked by <function>amgettuple</function>
+     callers.
+    </para>
+
+    <para>
+     If the index supports <link linkend="indexes-index-only-scans">index-only
+     scans</link> (i.e., <function>amcanreturn</function> returns true for it),
+     then on success the AM must also check <literal>scan-&gt;xs_want_itup</literal>,
+     and if that is true it must return the originally indexed data for the
+     index entry.  The data can be returned in the form of an
+     <structname>IndexTuple</structname> pointer stored at <literal>scan-&gt;xs_itup</literal>,
+     with tuple descriptor <literal>scan-&gt;xs_itupdesc</literal>; or in the form of
+     a <structname>HeapTuple</structname> pointer stored at <literal>scan-&gt;xs_hitup</literal>,
+     with tuple descriptor <literal>scan-&gt;xs_hitupdesc</literal>.  (The latter
+     format should be used when reconstructing data that might possibly not fit
+     into an <structname>IndexTuple</structname>.)  In either case,
+     management of the data referenced by the pointer is the access method's
+     responsibility.  The data must remain good at least until the next
+     <function>amgettuple</function>, <function>amrescan</function>, or <function>amendscan</function>
+     call for the scan.
+    </para>
+
+    <para>
+     The <function>amgettuple</function> function need only be provided if the access
+     method supports <quote>plain</quote> index scans.  If it doesn't, the
+     <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
+     struct must be set to NULL.
+    </para>
+
   <para>
    In addition to supporting ordinary index scans, some types of index
    may wish to support <firstterm>parallel index scans</firstterm>, which allow
@@ -766,7 +885,7 @@ amrestrpos (IndexScanDesc scan);
    functions may be implemented to support parallel index scans:
   </para>
 
-  <para>
+    <para>
 <programlisting>
 Size
 amestimateparallelscan (void);
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 56fbd45178..38d0bfa4d9 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1297,6 +1297,34 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
    and later will recognize such cases and allow index-only scans to be
    generated, but older versions will not.
   </para>
+
+  <sect2 id="indexes-index-skip-scans">
+    <title>Index Skip Scans</title>
+
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index</primary>
+      <secondary>index-skip scans</secondary>
+    </indexterm>
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index-skip scan</primary>
+    </indexterm>
+
+    <para>
+     When the rows retrieved from an index scan are then deduplicated by
+     eliminating rows matching on a prefix of index keys (e.g. when using
+     <literal>SELECT DISTINCT</literal>), the planner will consider
+     skipping groups of rows with a matching key prefix. When a row with
+     a particular prefix is found, remaining rows with the same key prefix
+     are skipped.  The larger the number of rows with the same key prefix
+     rows (i.e. the lower the number of distinct key prefixes in the index),
+     the more efficient this is.
+    </para>
+    <para>
+      Additionally, a skip scan can be considered in regular <literal>SELECT</literal>
+      queries. When filtering on an non-leading attribute of an index, the planner
+      may choose a skip scan.
+    </para>
+  </sect2>
  </sect1>
 
 
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ccc9fa0959..7efce94edb 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -118,6 +118,9 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = brincostestimate;
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6d2d71be32..ed5d5040ee 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -66,6 +66,9 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = gincostestimate;
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 0683f42c25..3748dd30b6 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -87,6 +87,9 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = gistcostestimate;
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index eb3810494f..eaf5431a72 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -84,6 +84,9 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = hashcostestimate;
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 5e22479b7a..bd54e2ff64 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -14,7 +14,9 @@
  *		index_open		- open an index relation by relation OID
  *		index_close		- close an index relation
  *		index_beginscan - start a scan of an index with amgettuple
+ *		index_beginscan_skip - start a scan of an index with amgettuple and skipping
  *		index_beginscan_bitmap - start a scan of an index with amgetbitmap
+ *		index_beginscan_bitmap_skip - start a skip scan of an index with amgetbitmap
  *		index_rescan	- restart a scan of an index
  *		index_endscan	- end a scan
  *		index_insert	- insert an index tuple into a relation
@@ -25,14 +27,17 @@
  *		index_parallelrescan  - (re)start a parallel scan of an index
  *		index_beginscan_parallel - join parallel index scan
  *		index_getnext_tid	- get the next TID from a scan
+ *		index_getnext_tid_skip	- get the next TID from a skip scan
  *		index_fetch_heap		- get the scan's next heap tuple
  *		index_getnext_slot	- get the next tuple from a scan
+ *		index_getnext_slot	- get the next tuple from a skip scan
  *		index_getbitmap - get all tuples from a scan
  *		index_bulk_delete	- bulk deletion of index tuples
  *		index_vacuum_cleanup	- post-deletion cleanup of an index
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_skip		- advance past duplicate key values in a scan
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -224,6 +229,78 @@ index_beginscan(Relation heapRelation,
 	return scan;
 }
 
+static IndexScanDesc
+index_beginscan_internal_skip(Relation indexRelation,
+						 int nkeys, int norderbys, int prefix, Snapshot snapshot,
+						 ParallelIndexScanDesc pscan, bool temp_snap)
+{
+	IndexScanDesc scan;
+
+	RELATION_CHECKS;
+	CHECK_REL_PROCEDURE(ambeginskipscan);
+
+	if (!(indexRelation->rd_indam->ampredlocks))
+		PredicateLockRelation(indexRelation, snapshot);
+
+	/*
+	 * We hold a reference count to the relcache entry throughout the scan.
+	 */
+	RelationIncrementReferenceCount(indexRelation);
+
+	/*
+	 * Tell the AM to open a scan.
+	 */
+	scan = indexRelation->rd_indam->ambeginskipscan(indexRelation, nkeys,
+												norderbys, prefix);
+	/* Initialize information for parallel scan. */
+	scan->parallel_scan = pscan;
+	scan->xs_temp_snap = temp_snap;
+
+	return scan;
+}
+
+IndexScanDesc
+index_beginscan_skip(Relation heapRelation,
+				Relation indexRelation,
+				Snapshot snapshot,
+				int nkeys, int norderbys, int prefix)
+{
+	IndexScanDesc scan;
+
+	scan = index_beginscan_internal_skip(indexRelation, nkeys, norderbys, prefix, snapshot, NULL, false);
+
+	/*
+	 * Save additional parameters into the scandesc.  Everything else was set
+	 * up by RelationGetIndexScan.
+	 */
+	scan->heapRelation = heapRelation;
+	scan->xs_snapshot = snapshot;
+
+	/* prepare to fetch index matches from table */
+	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+
+	return scan;
+}
+
+IndexScanDesc
+index_beginscan_bitmap_skip(Relation indexRelation,
+					   Snapshot snapshot,
+					   int nkeys,
+					   int prefix)
+{
+	IndexScanDesc scan;
+
+	scan = index_beginscan_internal_skip(indexRelation, nkeys, 0, prefix, snapshot, NULL, false);
+
+	/*
+	 * Save additional parameters into the scandesc.  Everything else was set
+	 * up by RelationGetIndexScan.
+	 */
+	scan->xs_snapshot = snapshot;
+
+	return scan;
+}
+
 /*
  * index_beginscan_bitmap - start a scan of an index with amgetbitmap
  *
@@ -553,6 +630,45 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return &scan->xs_heaptid;
 }
 
+ItemPointer
+index_getnext_tid_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	bool		found;
+
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgetskiptuple);
+
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/*
+	 * The AM's amgettuple proc finds the next index entry matching the scan
+	 * keys, and puts the TID into scan->xs_heaptid.  It should also set
+	 * scan->xs_recheck and possibly scan->xs_itup/scan->xs_hitup, though we
+	 * pay no attention to those fields here.
+	 */
+	found = scan->indexRelation->rd_indam->amgetskiptuple(scan, prefixDir, postfixDir);
+
+	/* Reset kill flag immediately for safety */
+	scan->kill_prior_tuple = false;
+	scan->xs_heap_continue = false;
+
+	/* If we're out of index entries, we're done */
+	if (!found)
+	{
+		/* release resources (like buffer pins) from table accesses */
+		if (scan->xs_heapfetch)
+			table_index_fetch_reset(scan->xs_heapfetch);
+
+		return NULL;
+	}
+	Assert(ItemPointerIsValid(&scan->xs_heaptid));
+
+	pgstat_count_index_tuples(scan->indexRelation, 1);
+
+	/* Return the TID of the tuple we found. */
+	return &scan->xs_heaptid;
+}
+
 /* ----------------
  *		index_fetch_heap - get the scan's next heap tuple
  *
@@ -644,6 +760,38 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
 	return false;
 }
 
+bool
+index_getnext_slot_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir, TupleTableSlot *slot)
+{
+	for (;;)
+	{
+		if (!scan->xs_heap_continue)
+		{
+			ItemPointer tid;
+
+			/* Time to fetch the next TID from the index */
+			tid = index_getnext_tid_skip(scan, prefixDir, postfixDir);
+
+			/* If we're out of index entries, we're done */
+			if (tid == NULL)
+				break;
+
+			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+		}
+
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scan->xs_heaptid));
+		if (index_fetch_heap(scan, slot))
+			return true;
+	}
+
+	return false;
+}
+
 /* ----------------
  *		index_getbitmap - get all tuples at once from an index scan
  *
@@ -739,6 +887,21 @@ index_can_return(Relation indexRelation, int attno)
 	return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
 }
 
+/* ----------------
+ *		index_skip
+ *
+ *		Skip past all tuples where the first 'prefix' columns have the
+ *		same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	SCAN_CHECKS;
+
+	return scan->indexRelation->rd_indam->amskip(scan, prefixDir, postfixDir);
+}
+
 /* ----------------
  *		index_getprocid
  *
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index d69808e78c..da96ac00a6 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -19,6 +19,7 @@ OBJS = \
 	nbtpage.o \
 	nbtree.o \
 	nbtsearch.o \
+	nbtskip.o \
 	nbtsort.o \
 	nbtsplitloc.o \
 	nbtutils.o \
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 7355e1dba1..483818e1e1 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -107,7 +107,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, itup, NULL);
 
 	if (checkingunique)
 	{
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 5bc7c3616a..361417c685 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1968,7 +1968,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, targetkey, NULL);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
 				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 40ad0956e0..0afaa3e34c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -123,6 +123,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
+	amroutine->amskip = btskip;
 	amroutine->amcostestimate = btcostestimate;
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
@@ -130,8 +131,10 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amvalidate = btvalidate;
 	amroutine->amadjustmembers = btadjustmembers;
 	amroutine->ambeginscan = btbeginscan;
+	amroutine->ambeginskipscan = btbeginscan_skip;
 	amroutine->amrescan = btrescan;
 	amroutine->amgettuple = btgettuple;
+	amroutine->amgetskiptuple = btgettuple_skip;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
@@ -208,6 +211,15 @@ btinsert(Relation rel, Datum *values, bool *isnull,
  */
 bool
 btgettuple(IndexScanDesc scan, ScanDirection dir)
+{
+	return btgettuple_skip(scan, dir, dir);
+}
+
+/*
+ *	btgettuple() -- Get the next tuple in the scan.
+ */
+bool
+btgettuple_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		res;
@@ -226,7 +238,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (so->numArrayKeys < 0)
 			return false;
 
-		_bt_start_array_keys(scan, dir);
+		_bt_start_array_keys(scan, prefixDir);
 	}
 
 	/* This loop handles advancing to the next array elements, if any */
@@ -238,7 +250,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_first() to get the first item in the scan.
 		 */
 		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+			res = _bt_first(scan, prefixDir, postfixDir);
 		else
 		{
 			/*
@@ -265,14 +277,14 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 			/*
 			 * Now continue the scan.
 			 */
-			res = _bt_next(scan, dir);
+			res = _bt_next(scan, prefixDir, postfixDir);
 		}
 
 		/* If we have a tuple, return it ... */
 		if (res)
 			break;
 		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+	} while (so->numArrayKeys && _bt_advance_array_keys(scan, prefixDir));
 
 	return res;
 }
@@ -303,7 +315,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	do
 	{
 		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		if (_bt_first(scan, ForwardScanDirection, ForwardScanDirection))
 		{
 			/* Save tuple ID, and continue scanning */
 			heapTid = &scan->xs_heaptid;
@@ -319,7 +331,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 				if (++so->currPos.itemIndex > so->currPos.lastItem)
 				{
 					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					if (!_bt_next(scan, ForwardScanDirection, ForwardScanDirection))
 						break;
 				}
 
@@ -340,6 +352,16 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
  */
 IndexScanDesc
 btbeginscan(Relation rel, int nkeys, int norderbys)
+{
+	return btbeginscan_skip(rel, nkeys, norderbys, -1);
+}
+
+
+/*
+ *	btbeginscan() -- start a scan on a btree index
+ */
+IndexScanDesc
+btbeginscan_skip(Relation rel, int nkeys, int norderbys, int skipPrefix)
 {
 	IndexScanDesc scan;
 	BTScanOpaque so;
@@ -374,10 +396,20 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	 */
 	so->currTuples = so->markTuples = NULL;
 
+	so->skipData = NULL;
+
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
 
+	if (skipPrefix > 0)
+	{
+		so->skipData = (BTSkip) palloc0(sizeof(BTSkipData));
+		so->skipData->prefix = skipPrefix;
+
+		elog(DEBUG1, "skip prefix: %d", skipPrefix);
+	}
+
 	return scan;
 }
 
@@ -440,6 +472,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	_bt_preprocess_array_keys(scan);
 }
 
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	return _bt_skip(scan, prefixDir, postfixDir);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -473,6 +514,8 @@ btendscan(IndexScanDesc scan)
 	if (so->currTuples != NULL)
 		pfree(so->currTuples);
 	/* so->markTuples should not be pfree'd, see btrescan */
+	if (_bt_skip_enabled(so))
+		pfree(so->skipData);
 	pfree(so);
 }
 
@@ -556,6 +599,9 @@ btrestrpos(IndexScanDesc scan)
 			if (so->currTuples)
 				memcpy(so->currTuples, so->markTuples,
 					   so->markPos.nextTupleOffset);
+			if (so->skipData)
+				memcpy(&so->skipData->curPos, &so->skipData->markPos,
+					   sizeof(BTSkipPosData));
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d1177d8772..faff9d8652 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -17,19 +17,17 @@
 
 #include "access/nbtree.h"
 #include "access/relscan.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/predicate.h"
+#include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
 
-static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
 static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
@@ -38,14 +36,12 @@ static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
 static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
 									   OffsetNumber offnum,
 									   ItemPointer heapTid, int tupleOffset);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline bool _bt_checkkeys_extended(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+										  ScanDirection dir, bool isRegularMode,
+										  bool *continuescan, int *prefixskipindex);
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -61,7 +57,7 @@ static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
  * will remain in shared memory for as long as it takes to scan the index
  * buffer page.
  */
-static void
+void
 _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
 {
 	_bt_unlockbuf(scan->indexRelation, sp->buf);
@@ -339,7 +335,7 @@ _bt_moveright(Relation rel,
  * the given page.  _bt_binsrch() has no lock or refcount side effects
  * on the buffer.
  */
-static OffsetNumber
+OffsetNumber
 _bt_binsrch(Relation rel,
 			BTScanInsert key,
 			Buffer buf)
@@ -845,25 +841,23 @@ _bt_compare(Relation rel,
  * in locating the scan start position.
  */
 bool
-_bt_first(IndexScanDesc scan, ScanDirection dir)
+_bt_first(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
 	OffsetNumber offnum;
-	StrategyNumber strat;
-	bool		nextkey;
 	bool		goback;
 	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
-	int			i;
 	bool		status;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
 	BlockNumber blkno;
+	IndexTuple itup;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -904,184 +898,13 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 		else if (blkno != InvalidBlockNumber)
 		{
-			if (!_bt_parallel_readpage(scan, blkno, dir))
+			if (!_bt_parallel_readpage(scan, blkno, prefixDir))
 				return false;
 			goto readcomplete;
 		}
 	}
 
-	/*----------
-	 * Examine the scan keys to discover where we need to start the scan.
-	 *
-	 * We want to identify the keys that can be used as starting boundaries;
-	 * these are =, >, or >= keys for a forward scan or =, <, <= keys for
-	 * a backwards scan.  We can use keys for multiple attributes so long as
-	 * the prior attributes had only =, >= (resp. =, <=) keys.  Once we accept
-	 * a > or < boundary or find an attribute with no boundary (which can be
-	 * thought of as the same as "> -infinity"), we can't use keys for any
-	 * attributes to its right, because it would break our simplistic notion
-	 * of what initial positioning strategy to use.
-	 *
-	 * When the scan keys include cross-type operators, _bt_preprocess_keys
-	 * may not be able to eliminate redundant keys; in such cases we will
-	 * arbitrarily pick a usable one for each attribute.  This is correct
-	 * but possibly not optimal behavior.  (For example, with keys like
-	 * "x >= 4 AND x >= 5" we would elect to scan starting at x=4 when
-	 * x=5 would be more efficient.)  Since the situation only arises given
-	 * a poorly-worded query plus an incomplete opfamily, live with it.
-	 *
-	 * When both equality and inequality keys appear for a single attribute
-	 * (again, only possible when cross-type operators appear), we *must*
-	 * select one of the equality keys for the starting point, because
-	 * _bt_checkkeys() will stop the scan as soon as an equality qual fails.
-	 * For example, if we have keys like "x >= 4 AND x = 10" and we elect to
-	 * start at x=4, we will fail and stop before reaching x=10.  If multiple
-	 * equality quals survive preprocessing, however, it doesn't matter which
-	 * one we use --- by definition, they are either redundant or
-	 * contradictory.
-	 *
-	 * Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
-	 * If the index stores nulls at the end of the index we'll be starting
-	 * from, and we have no boundary key for the column (which means the key
-	 * we deduced NOT NULL from is an inequality key that constrains the other
-	 * end of the index), then we cons up an explicit SK_SEARCHNOTNULL key to
-	 * use as a boundary key.  If we didn't do this, we might find ourselves
-	 * traversing a lot of null entries at the start of the scan.
-	 *
-	 * In this loop, row-comparison keys are treated the same as keys on their
-	 * first (leftmost) columns.  We'll add on lower-order columns of the row
-	 * comparison below, if possible.
-	 *
-	 * The selected scan keys (at most one per index column) are remembered by
-	 * storing their addresses into the local startKeys[] array.
-	 *----------
-	 */
-	strat_total = BTEqualStrategyNumber;
-	if (so->numberOfKeys > 0)
-	{
-		AttrNumber	curattr;
-		ScanKey		chosen;
-		ScanKey		impliesNN;
-		ScanKey		cur;
-
-		/*
-		 * chosen is the so-far-chosen key for the current attribute, if any.
-		 * We don't cast the decision in stone until we reach keys for the
-		 * next attribute.
-		 */
-		curattr = 1;
-		chosen = NULL;
-		/* Also remember any scankey that implies a NOT NULL constraint */
-		impliesNN = NULL;
-
-		/*
-		 * Loop iterates from 0 to numberOfKeys inclusive; we use the last
-		 * pass to handle after-last-key processing.  Actual exit from the
-		 * loop is at one of the "break" statements below.
-		 */
-		for (cur = so->keyData, i = 0;; cur++, i++)
-		{
-			if (i >= so->numberOfKeys || cur->sk_attno != curattr)
-			{
-				/*
-				 * Done looking at keys for curattr.  If we didn't find a
-				 * usable boundary key, see if we can deduce a NOT NULL key.
-				 */
-				if (chosen == NULL && impliesNN != NULL &&
-					((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
-					 ScanDirectionIsForward(dir) :
-					 ScanDirectionIsBackward(dir)))
-				{
-					/* Yes, so build the key in notnullkeys[keysCount] */
-					chosen = &notnullkeys[keysCount];
-					ScanKeyEntryInitialize(chosen,
-										   (SK_SEARCHNOTNULL | SK_ISNULL |
-											(impliesNN->sk_flags &
-											 (SK_BT_DESC | SK_BT_NULLS_FIRST))),
-										   curattr,
-										   ((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
-											BTGreaterStrategyNumber :
-											BTLessStrategyNumber),
-										   InvalidOid,
-										   InvalidOid,
-										   InvalidOid,
-										   (Datum) 0);
-				}
-
-				/*
-				 * If we still didn't find a usable boundary key, quit; else
-				 * save the boundary key pointer in startKeys.
-				 */
-				if (chosen == NULL)
-					break;
-				startKeys[keysCount++] = chosen;
-
-				/*
-				 * Adjust strat_total, and quit if we have stored a > or <
-				 * key.
-				 */
-				strat = chosen->sk_strategy;
-				if (strat != BTEqualStrategyNumber)
-				{
-					strat_total = strat;
-					if (strat == BTGreaterStrategyNumber ||
-						strat == BTLessStrategyNumber)
-						break;
-				}
-
-				/*
-				 * Done if that was the last attribute, or if next key is not
-				 * in sequence (implying no boundary key is available for the
-				 * next attribute).
-				 */
-				if (i >= so->numberOfKeys ||
-					cur->sk_attno != curattr + 1)
-					break;
-
-				/*
-				 * Reset for next attr.
-				 */
-				curattr = cur->sk_attno;
-				chosen = NULL;
-				impliesNN = NULL;
-			}
-
-			/*
-			 * Can we use this key as a starting boundary for this attr?
-			 *
-			 * If not, does it imply a NOT NULL constraint?  (Because
-			 * SK_SEARCHNULL keys are always assigned BTEqualStrategyNumber,
-			 * *any* inequality key works for that; we need not test.)
-			 */
-			switch (cur->sk_strategy)
-			{
-				case BTLessStrategyNumber:
-				case BTLessEqualStrategyNumber:
-					if (chosen == NULL)
-					{
-						if (ScanDirectionIsBackward(dir))
-							chosen = cur;
-						else
-							impliesNN = cur;
-					}
-					break;
-				case BTEqualStrategyNumber:
-					/* override any non-equality choice */
-					chosen = cur;
-					break;
-				case BTGreaterEqualStrategyNumber:
-				case BTGreaterStrategyNumber:
-					if (chosen == NULL)
-					{
-						if (ScanDirectionIsForward(dir))
-							chosen = cur;
-						else
-							impliesNN = cur;
-					}
-					break;
-			}
-		}
-	}
+	keysCount = _bt_choose_scan_keys(so->keyData, so->numberOfKeys, prefixDir, startKeys, notnullkeys, &strat_total, 0);
 
 	/*
 	 * If we found no usable boundary keys, we have to start from one end of
@@ -1092,260 +915,112 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	{
 		bool		match;
 
-		match = _bt_endpoint(scan, dir);
-
-		if (!match)
+		if (!_bt_skip_enabled(so))
 		{
-			/* No match, so mark (parallel) scan finished */
-			_bt_parallel_done(scan);
-		}
+			match = _bt_endpoint(scan, prefixDir);
 
-		return match;
-	}
+			if (!match)
+			{
+				/* No match, so mark (parallel) scan finished */
+				_bt_parallel_done(scan);
+			}
 
-	/*
-	 * We want to start the scan somewhere within the index.  Set up an
-	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built using the keys
-	 * identified by startKeys[].  (Remaining insertion scankey fields are
-	 * initialized after initial-positioning strategy is finalized.)
-	 */
-	Assert(keysCount <= INDEX_MAX_KEYS);
-	for (i = 0; i < keysCount; i++)
-	{
-		ScanKey		cur = startKeys[i];
+			return match;
+		}
+		else
+		{
+			Relation	rel = scan->indexRelation;
+			Buffer		buf;
+			Page		page;
+			BTPageOpaque opaque;
+			OffsetNumber start;
+			BTSkipCompareResult cmp = {0};
 
-		Assert(cur->sk_attno == i + 1);
+			_bt_skip_create_scankeys(rel, so);
 
-		if (cur->sk_flags & SK_ROW_HEADER)
-		{
 			/*
-			 * Row comparison header: look to the first row member instead.
-			 *
-			 * The member scankeys are already in insertion format (ie, they
-			 * have sk_func = 3-way-comparison function), but we have to watch
-			 * out for nulls, which _bt_preprocess_keys didn't check. A null
-			 * in the first row member makes the condition unmatchable, just
-			 * like qual_ok = false.
+			 * Scan down to the leftmost or rightmost leaf page and position
+			 * the scan on the leftmost or rightmost item on that page.
+			 * Start the skip scan from there to find the first matching item
 			 */
-			ScanKey		subkey = (ScanKey) DatumGetPointer(cur->sk_argument);
+			buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(prefixDir), scan->xs_snapshot);
 
-			Assert(subkey->sk_flags & SK_ROW_MEMBER);
-			if (subkey->sk_flags & SK_ISNULL)
+			if (!BufferIsValid(buf))
 			{
-				_bt_parallel_done(scan);
+				/*
+				 * Empty index. Lock the whole relation, as nothing finer to lock
+				 * exists.
+				 */
+				PredicateLockRelation(rel, scan->xs_snapshot);
+				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
-			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
 
-			/*
-			 * If the row comparison is the last positioning key we accepted,
-			 * try to add additional keys from the lower-order row members.
-			 * (If we accepted independent conditions on additional index
-			 * columns, we use those instead --- doesn't seem worth trying to
-			 * determine which is more restrictive.)  Note that this is OK
-			 * even if the row comparison is of ">" or "<" type, because the
-			 * condition applied to all but the last row member is effectively
-			 * ">=" or "<=", and so the extra keys don't break the positioning
-			 * scheme.  But, by the same token, if we aren't able to use all
-			 * the row members, then the part of the row comparison that we
-			 * did use has to be treated as just a ">=" or "<=" condition, and
-			 * so we'd better adjust strat_total accordingly.
-			 */
-			if (i == keysCount - 1)
+			PredicateLockPage(rel, BufferGetBlockNumber(buf), scan->xs_snapshot);
+			page = BufferGetPage(buf);
+			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			Assert(P_ISLEAF(opaque));
+
+			if (ScanDirectionIsForward(prefixDir))
 			{
-				bool		used_all_subkeys = false;
+				/* There could be dead pages to the left, so not this: */
+				/* Assert(P_LEFTMOST(opaque)); */
 
-				Assert(!(subkey->sk_flags & SK_ROW_END));
-				for (;;)
-				{
-					subkey++;
-					Assert(subkey->sk_flags & SK_ROW_MEMBER);
-					if (subkey->sk_attno != keysCount + 1)
-						break;	/* out-of-sequence, can't use it */
-					if (subkey->sk_strategy != cur->sk_strategy)
-						break;	/* wrong direction, can't use it */
-					if (subkey->sk_flags & SK_ISNULL)
-						break;	/* can't use null keys */
-					Assert(keysCount < INDEX_MAX_KEYS);
-					memcpy(inskey.scankeys + keysCount, subkey,
-						   sizeof(ScanKeyData));
-					keysCount++;
-					if (subkey->sk_flags & SK_ROW_END)
-					{
-						used_all_subkeys = true;
-						break;
-					}
-				}
-				if (!used_all_subkeys)
-				{
-					switch (strat_total)
-					{
-						case BTLessStrategyNumber:
-							strat_total = BTLessEqualStrategyNumber;
-							break;
-						case BTGreaterStrategyNumber:
-							strat_total = BTGreaterEqualStrategyNumber;
-							break;
-					}
-				}
-				break;			/* done with outer loop */
+				start = P_FIRSTDATAKEY(opaque);
 			}
-		}
-		else
-		{
-			/*
-			 * Ordinary comparison key.  Transform the search-style scan key
-			 * to an insertion scan key by replacing the sk_func with the
-			 * appropriate btree comparison function.
-			 *
-			 * If scankey operator is not a cross-type comparison, we can use
-			 * the cached comparison function; otherwise gotta look it up in
-			 * the catalogs.  (That can't lead to infinite recursion, since no
-			 * indexscan initiated by syscache lookup will use cross-data-type
-			 * operators.)
-			 *
-			 * We support the convention that sk_subtype == InvalidOid means
-			 * the opclass input type; this is a hack to simplify life for
-			 * ScanKeyInit().
-			 */
-			if (cur->sk_subtype == rel->rd_opcintype[i] ||
-				cur->sk_subtype == InvalidOid)
+			else if (ScanDirectionIsBackward(prefixDir))
 			{
-				FmgrInfo   *procinfo;
-
-				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
-				ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
-											   cur->sk_flags,
-											   cur->sk_attno,
-											   InvalidStrategy,
-											   cur->sk_subtype,
-											   cur->sk_collation,
-											   procinfo,
-											   cur->sk_argument);
+				Assert(P_RIGHTMOST(opaque));
+
+				start = PageGetMaxOffsetNumber(page);
 			}
 			else
 			{
-				RegProcedure cmp_proc;
-
-				cmp_proc = get_opfamily_proc(rel->rd_opfamily[i],
-											 rel->rd_opcintype[i],
-											 cur->sk_subtype,
-											 BTORDER_PROC);
-				if (!RegProcedureIsValid(cmp_proc))
-					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
-						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
-						 cur->sk_attno, RelationGetRelationName(rel));
-				ScanKeyEntryInitialize(inskey.scankeys + i,
-									   cur->sk_flags,
-									   cur->sk_attno,
-									   InvalidStrategy,
-									   cur->sk_subtype,
-									   cur->sk_collation,
-									   cmp_proc,
-									   cur->sk_argument);
+				elog(ERROR, "invalid scan direction: %d", (int) prefixDir);
 			}
-		}
-	}
 
-	/*----------
-	 * Examine the selected initial-positioning strategy to determine exactly
-	 * where we need to start the scan, and set flag variables to control the
-	 * code below.
-	 *
-	 * If nextkey = false, _bt_search and _bt_binsrch will locate the first
-	 * item >= scan key.  If nextkey = true, they will locate the first
-	 * item > scan key.
-	 *
-	 * If goback = true, we will then step back one item, while if
-	 * goback = false, we will start the scan on the located item.
-	 *----------
-	 */
-	switch (strat_total)
-	{
-		case BTLessStrategyNumber:
-
-			/*
-			 * Find first item >= scankey, then back up one to arrive at last
-			 * item < scankey.  (Note: this positioning strategy is only used
-			 * for a backward scan, so that is always the correct starting
-			 * position.)
-			 */
-			nextkey = false;
-			goback = true;
-			break;
-
-		case BTLessEqualStrategyNumber:
-
-			/*
-			 * Find first item > scankey, then back up one to arrive at last
-			 * item <= scankey.  (Note: this positioning strategy is only used
-			 * for a backward scan, so that is always the correct starting
-			 * position.)
-			 */
-			nextkey = true;
-			goback = true;
-			break;
-
-		case BTEqualStrategyNumber:
-
-			/*
-			 * If a backward scan was specified, need to start with last equal
-			 * item not first one.
+			/* remember which buffer we have pinned */
+			so->currPos.buf = buf;
+			so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+			itup = _bt_get_tuple_from_offset(so, start);
+			/* in some cases, we can (or have to) skip further inside the prefix.
+			 * we can do this if we have extra quals becoming available, eg.
+			 * WHERE b=2 on an index on (a,b).
+			 * We must, if this is not regular mode (prefixDir!=postfixDir).
+			 * Because this means we're at the end of the prefix, while we should be
+			 * at the beginning.
 			 */
-			if (ScanDirectionIsBackward(dir))
+			if (_bt_has_extra_quals_after_skip(so->skipData, postfixDir, 0) ||
+					!_bt_skip_is_regular_mode(prefixDir, postfixDir))
 			{
-				/*
-				 * This is the same as the <= strategy.  We will check at the
-				 * end whether the found item is actually =.
-				 */
-				nextkey = true;
-				goback = true;
+				_bt_skip_extra_conditions(scan, &itup, &start, prefixDir, postfixDir, &cmp);
 			}
-			else
+			/* now find the next matching tuple */
+			match = _bt_skip_find_next(scan, itup, start, prefixDir, postfixDir);
+			if (!match)
 			{
-				/*
-				 * This is the same as the >= strategy.  We will check at the
-				 * end whether the found item is actually =.
-				 */
-				nextkey = false;
-				goback = false;
+				if (_bt_skip_is_always_valid(so))
+					_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+				return false;
 			}
-			break;
 
-		case BTGreaterEqualStrategyNumber:
+			_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
-			/*
-			 * Find first item >= scankey.  (This is only used for forward
-			 * scans.)
-			 */
-			nextkey = false;
-			goback = false;
-			break;
-
-		case BTGreaterStrategyNumber:
-
-			/*
-			 * Find first item > scankey.  (This is only used for forward
-			 * scans.)
-			 */
-			nextkey = true;
-			goback = false;
-			break;
+			currItem = &so->currPos.items[so->currPos.itemIndex];
+			scan->xs_heaptid = currItem->heapTid;
+			if (scan->xs_want_itup)
+				scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
 
-		default:
-			/* can't get here, but keep compiler quiet */
-			elog(ERROR, "unrecognized strat_total: %d", (int) strat_total);
-			return false;
+			return true;
+		}
 	}
 
-	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
-	inskey.anynullkeys = false; /* unused */
-	inskey.nextkey = nextkey;
-	inskey.pivotsearch = false;
-	inskey.scantid = NULL;
-	inskey.keysz = keysCount;
+	if (!_bt_create_insertion_scan_key(rel, prefixDir, startKeys, keysCount, &inskey, &strat_total,  &goback))
+	{
+		_bt_parallel_done(scan);
+		return false;
+	}
 
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
@@ -1377,7 +1052,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		PredicateLockPage(rel, BufferGetBlockNumber(buf),
 						  scan->xs_snapshot);
 
-	_bt_initialize_more_data(so, dir);
+	_bt_initialize_more_data(so, prefixDir);
 
 	/* position to the precise item on the page */
 	offnum = _bt_binsrch(rel, &inskey, buf);
@@ -1407,23 +1082,81 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	Assert(!BTScanPosIsValid(so->currPos));
 	so->currPos.buf = buf;
 
-	/*
-	 * Now load data from the first page of the scan.
-	 */
-	if (!_bt_readpage(scan, dir, offnum))
+	if (_bt_skip_enabled(so))
 	{
-		/*
-		 * There's no actually-matching data on this page.  Try to advance to
-		 * the next page.  Return false if there's no matching data at all.
+		Page page;
+		BTPageOpaque opaque;
+		OffsetNumber minoff;
+		bool match;
+		BTSkipCompareResult cmp = {0};
+
+		/* first create the skip scan keys */
+		_bt_skip_create_scankeys(rel, so);
+
+		/* remember which page we have pinned */
+		so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+		page = BufferGetPage(so->currPos.buf);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		minoff = P_FIRSTDATAKEY(opaque);
+		/* _binsrch + goback parameter can leave the offnum before the first item on the page
+		 * or after the last item on the page. if that is the case we need to either step
+		 * back or forward one page
 		 */
-		_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
-		if (!_bt_steppage(scan, dir))
+		if (offnum < minoff)
+		{
+			_bt_unlockbuf(rel, so->currPos.buf);
+			if (!_bt_step_back_page(scan, &itup, &offnum))
+				return false;
+			page = BufferGetPage(so->currPos.buf);
+		}
+		else if (offnum > PageGetMaxOffsetNumber(page))
+		{
+			BlockNumber next = opaque->btpo_next;
+			_bt_unlockbuf(rel, so->currPos.buf);
+			if (!_bt_step_forward_page(scan, next, &itup, &offnum))
+				return false;
+			page = BufferGetPage(so->currPos.buf);
+		}
+
+		itup = _bt_get_tuple_from_offset(so, offnum);
+		/* check if we can skip even more because we can use new conditions */
+		if (_bt_has_extra_quals_after_skip(so->skipData, postfixDir, inskey.keysz) ||
+				!_bt_skip_is_regular_mode(prefixDir, postfixDir))
+		{
+			_bt_skip_extra_conditions(scan, &itup, &offnum, prefixDir, postfixDir, &cmp);
+		}
+		/* now find the tuple */
+		match = _bt_skip_find_next(scan, itup, offnum, prefixDir, postfixDir);
+		if (!match)
+		{
+			if (_bt_skip_is_always_valid(so))
+				_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 			return false;
+		}
+
+		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 	else
 	{
-		/* Drop the lock, and maybe the pin, on the current page */
-		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+		/*
+		 * Now load data from the first page of the scan.
+		 */
+		if (!_bt_readpage(scan, prefixDir, &offnum, true))
+		{
+			/*
+			 * There's no actually-matching data on this page.  Try to advance to
+			 * the next page.  Return false if there's no matching data at all.
+			 */
+			_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+			if (!_bt_steppage(scan, prefixDir))
+				return false;
+		}
+		else
+		{
+			/* Drop the lock, and maybe the pin, on the current page */
+			_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+		}
 	}
 
 readcomplete:
@@ -1451,29 +1184,113 @@ readcomplete:
  *		so->currPos.buf to InvalidBuffer.
  */
 bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
+_bt_next(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTScanPosItem *currItem;
 
-	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
-	 */
-	if (ScanDirectionIsForward(dir))
+	if (!_bt_skip_enabled(so))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		/*
+		 * Advance to next tuple on current page; or if there's no more, try to
+		 * step to the next page with data.
+		 */
+		if (ScanDirectionIsForward(prefixDir))
 		{
-			if (!_bt_steppage(scan, dir))
-				return false;
+			if (++so->currPos.itemIndex > so->currPos.lastItem)
+			{
+				if (!_bt_steppage(scan, prefixDir))
+					return false;
+			}
+		}
+		else
+		{
+			if (--so->currPos.itemIndex < so->currPos.firstItem)
+			{
+				if (!_bt_steppage(scan, prefixDir))
+					return false;
+			}
 		}
 	}
 	else
 	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		bool match;
+		IndexTuple itup = NULL;
+		OffsetNumber offnum = InvalidOffsetNumber;
+
+		if (ScanDirectionIsForward(postfixDir))
 		{
-			if (!_bt_steppage(scan, dir))
-				return false;
+			if (++so->currPos.itemIndex > so->currPos.lastItem)
+			{
+				if (prefixDir != so->skipData->curPos.nextDirection)
+				{
+					/* this happens when doing a cursor scan and changing
+					 * direction in the meantime. eg. first fetch forwards,
+					 * then backwards.
+					 * we *always* just go to the next page instead of skipping,
+					 * because that's the only safe option.
+					 */
+					so->skipData->curPos.nextAction = SkipStateNext;
+					so->skipData->curPos.nextDirection = prefixDir;
+				}
+
+				if (so->skipData->curPos.nextAction == SkipStateNext)
+				{
+					/* we should just go forwards one page, no skipping is necessary */
+					if (!_bt_step_forward_page(scan, so->currPos.nextPage, &itup, &offnum))
+						return false;
+				}
+				else if (so->skipData->curPos.nextAction == SkipStateStop)
+				{
+					/* we've reached the end of the index, or we cannot find any more keys */
+					BTScanPosUnpinIfPinned(so->currPos);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+
+				/* now find the next tuple */
+				match = _bt_skip_find_next(scan, itup, offnum, prefixDir, postfixDir);
+				if (!match)
+				{
+					if (_bt_skip_is_always_valid(so))
+						_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+					return false;
+				}
+				_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+			}
+		}
+		else
+		{
+			if (--so->currPos.itemIndex < so->currPos.firstItem)
+			{
+				if (prefixDir != so->skipData->curPos.nextDirection)
+				{
+					so->skipData->curPos.nextAction = SkipStateNext;
+					so->skipData->curPos.nextDirection = prefixDir;
+				}
+
+				if (so->skipData->curPos.nextAction == SkipStateNext)
+				{
+					if (!_bt_step_back_page(scan, &itup, &offnum))
+						return false;
+				}
+				else if (so->skipData->curPos.nextAction == SkipStateStop)
+				{
+					BTScanPosUnpinIfPinned(so->currPos);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+
+				/* now find the next tuple */
+				match = _bt_skip_find_next(scan, itup, offnum, prefixDir, postfixDir);
+				if (!match)
+				{
+					if (_bt_skip_is_always_valid(so))
+						_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+					return false;
+				}
+				_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+			}
 		}
 	}
 
@@ -1505,8 +1322,8 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  *
  * Returns true if any matching items found on the page, false if none.
  */
-static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
+bool
+_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber *offnum, bool isRegularMode)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
@@ -1516,6 +1333,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	int			itemIndex;
 	bool		continuescan;
 	int			indnatts;
+	int			prefixskipindex;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1574,11 +1392,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
-		offnum = Max(offnum, minoff);
+		*offnum = Max(*offnum, minoff);
 
-		while (offnum <= maxoff)
+		while (*offnum <= maxoff)
 		{
-			ItemId		iid = PageGetItemId(page, offnum);
+			ItemId		iid = PageGetItemId(page, *offnum);
 			IndexTuple	itup;
 
 			/*
@@ -1587,19 +1405,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
-				offnum = OffsetNumberNext(offnum);
+				*offnum = OffsetNumberNext(*offnum);
 				continue;
 			}
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+			if (_bt_checkkeys_extended(scan, itup, indnatts, dir, isRegularMode, &continuescan, &prefixskipindex))
 			{
 				/* tuple passes all scan key conditions */
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(so, itemIndex, *offnum, itup);
 					itemIndex++;
 				}
 				else
@@ -1611,26 +1429,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 					 * TID
 					 */
 					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+						_bt_setuppostingitems(so, itemIndex, *offnum,
 											  BTreeTupleGetPostingN(itup, 0),
 											  itup);
 					itemIndex++;
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(so, itemIndex, *offnum,
 											BTreeTupleGetPostingN(itup, i),
 											tupleOffset);
 						itemIndex++;
 					}
 				}
 			}
+
+			*offnum = OffsetNumberNext(*offnum);
+
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
 				break;
-
-			offnum = OffsetNumberNext(offnum);
+			if (!isRegularMode && prefixskipindex != -1)
+				break;
 		}
+		*offnum = OffsetNumberPrev(*offnum);
 
 		/*
 		 * We don't need to visit page to the right when the high key
@@ -1650,7 +1472,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			int			truncatt;
 
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, NULL);
 		}
 
 		if (!continuescan)
@@ -1666,11 +1488,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
-		offnum = Min(offnum, maxoff);
+		*offnum = Min(*offnum, maxoff);
 
-		while (offnum >= minoff)
+		while (*offnum >= minoff)
 		{
-			ItemId		iid = PageGetItemId(page, offnum);
+			ItemId		iid = PageGetItemId(page, *offnum);
 			IndexTuple	itup;
 			bool		tuple_alive;
 			bool		passes_quals;
@@ -1687,10 +1509,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
-				Assert(offnum >= P_FIRSTDATAKEY(opaque));
-				if (offnum > P_FIRSTDATAKEY(opaque))
+				Assert(*offnum >= P_FIRSTDATAKEY(opaque));
+				if (*offnum > P_FIRSTDATAKEY(opaque))
 				{
-					offnum = OffsetNumberPrev(offnum);
+					*offnum = OffsetNumberPrev(*offnum);
 					continue;
 				}
 
@@ -1701,8 +1523,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan);
+			passes_quals = _bt_checkkeys_extended(scan, itup, indnatts, dir,
+												  isRegularMode, &continuescan, &prefixskipindex);
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
@@ -1710,7 +1532,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(so, itemIndex, *offnum, itup);
 				}
 				else
 				{
@@ -1728,28 +1550,32 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 					 */
 					itemIndex--;
 					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+						_bt_setuppostingitems(so, itemIndex, *offnum,
 											  BTreeTupleGetPostingN(itup, 0),
 											  itup);
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(so, itemIndex, *offnum,
 											BTreeTupleGetPostingN(itup, i),
 											tupleOffset);
 					}
 				}
 			}
+
+			*offnum = OffsetNumberPrev(*offnum);
+
 			if (!continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
 				break;
 			}
-
-			offnum = OffsetNumberPrev(offnum);
+			if (!isRegularMode && prefixskipindex != -1)
+				break;
 		}
+		*offnum = OffsetNumberNext(*offnum);
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
@@ -1857,7 +1683,7 @@ _bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
  * read lock, on that page.  If we do not hold the pin, we set so->currPos.buf
  * to InvalidBuffer.  We return true to indicate success.
  */
-static bool
+bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -1885,6 +1711,9 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		if (so->markTuples)
 			memcpy(so->markTuples, so->currTuples,
 				   so->currPos.nextTupleOffset);
+		if (so->skipData)
+			memcpy(&so->skipData->markPos, &so->skipData->curPos,
+				   sizeof(BTSkipPosData));
 		so->markPos.itemIndex = so->markItemIndex;
 		so->markItemIndex = -1;
 	}
@@ -1964,7 +1793,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * If there are no more matching records in the given direction, we drop all
  * locks and pins, set so->currPos.buf to InvalidBuffer, and return false.
  */
-static bool
+bool
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -1972,6 +1801,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 	Page		page;
 	BTPageOpaque opaque;
 	bool		status;
+	OffsetNumber offnum;
 
 	rel = scan->indexRelation;
 
@@ -2002,7 +1832,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+				offnum = P_FIRSTDATAKEY(opaque);
+				if (_bt_readpage(scan, dir, &offnum, true))
 					break;
 			}
 			else if (scan->parallel_scan != NULL)
@@ -2104,7 +1935,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+				offnum = PageGetMaxOffsetNumber(page);
+				if (_bt_readpage(scan, dir, &offnum, true))
 					break;
 			}
 			else if (scan->parallel_scan != NULL)
@@ -2172,7 +2004,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * to be half-dead; the caller should check that condition and step left
  * again if it's important.
  */
-static Buffer
+Buffer
 _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 {
 	Page		page;
@@ -2436,7 +2268,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readpage(scan, dir, start))
+	if (!_bt_readpage(scan, dir, &start, true))
 	{
 		/*
 		 * There's no actually-matching data on this page.  Try to advance to
@@ -2465,7 +2297,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
  * _bt_initialize_more_data() -- initialize moreLeft/moreRight appropriately
  * for scan direction
  */
-static inline void
+inline void
 _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 {
 	/* initialize moreLeft/moreRight appropriately for scan direction */
@@ -2482,3 +2314,25 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 	so->numKilled = 0;			/* just paranoia */
 	so->markItemIndex = -1;		/* ditto */
 }
+
+/* Forward the call to either _bt_checkkeys, which is a simple
+ * and fastest way of checking keys, or to _bt_checkkeys_skip,
+ * which is a slower way to check the keys, but it will return extra
+ * information about whether or not we should stop reading the current page
+ * and skip. The expensive checking is only necessary when !isRegularMode, eg.
+ * when prefixDir!=postfixDir, which only happens when scanning from cursors backwards
+ */
+static inline bool
+_bt_checkkeys_extended(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+					   ScanDirection dir, bool isRegularMode,
+					   bool *continuescan, int *prefixskipindex)
+{
+	if (isRegularMode)
+	{
+		return _bt_checkkeys(scan, tuple, tupnatts, dir, continuescan, prefixskipindex);
+	}
+	else
+	{
+		return _bt_checkkeys_skip(scan, tuple, tupnatts, dir, continuescan, prefixskipindex);
+	}
+}
diff --git a/src/backend/access/nbtree/nbtskip.c b/src/backend/access/nbtree/nbtskip.c
new file mode 100644
index 0000000000..e2dbaf2e69
--- /dev/null
+++ b/src/backend/access/nbtree/nbtskip.c
@@ -0,0 +1,1455 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtskip.c
+ *	  Search code related to skip scan for postgres btrees.
+ *
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtskip.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/relscan.h"
+#include "catalog/catalog.h"
+#include "miscadmin.h"
+#include "utils/guc.h"
+#include "storage/predicate.h"
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+
+static inline void _bt_update_scankey_with_tuple(BTScanInsert scankeys,
+											Relation indexRel, IndexTuple itup, int numattrs);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, Buffer buf);
+static inline int32 _bt_compare_until(Relation rel, BTScanInsert key, IndexTuple itup, int prefix);
+static inline void
+_bt_determine_next_action(IndexScanDesc scan, BTSkipCompareResult *cmp, OffsetNumber firstOffnum,
+						  OffsetNumber lastOffnum, ScanDirection postfixDir, BTSkipState *nextAction);
+static inline void
+_bt_determine_next_action_after_skip(BTScanOpaque so, BTSkipCompareResult *cmp, ScanDirection prefixDir,
+									 ScanDirection postfixDir, int skipped, BTSkipState *nextAction);
+static inline void
+_bt_determine_next_action_after_skip_extra(BTScanOpaque so, BTSkipCompareResult *cmp, BTSkipState *nextAction);
+static inline void _bt_copy_scankey(BTScanInsert to, BTScanInsert from, int numattrs);
+static inline IndexTuple _bt_get_tuple_from_offset_with_copy(BTScanOpaque so, OffsetNumber curTupleOffnum);
+
+static void _bt_skip_update_scankey_after_read(IndexScanDesc scan, IndexTuple curTuple,
+											   ScanDirection prefixDir, ScanDirection postfixDir);
+static void _bt_skip_update_scankey_for_prefix_skip(IndexScanDesc scan, Relation indexRel,
+										int prefix, IndexTuple itup, ScanDirection prefixDir);
+static bool _bt_try_in_page_skip(IndexScanDesc scan, ScanDirection prefixDir);
+static void debug_print(IndexTuple itup, BTScanInsert scanKey, Relation rel, char *extra);
+
+/* probably to be removed but useful for debugging during patch implementation */
+static void debug_print(IndexTuple itup, BTScanInsert scanKey, Relation rel, char *extra)
+{
+	bool		isnull[INDEX_MAX_KEYS];
+	Datum		values[INDEX_MAX_KEYS];
+	char	   *lkey_desc = NULL;
+
+	/* Avoid infinite recursion -- don't instrument catalog indexes */
+	if (!IsCatalogRelation(rel))
+	{
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		Oid			typOutput;
+		bool		varlenatype;
+		char	   *val;
+		int i;
+
+		char buf[8096] = {0};
+		int idx = 0;
+
+		if (itup != NULL)
+		{
+			natts = BTreeTupleGetNAtts(itup, rel);
+			itupdesc->natts = Min(indnkeyatts, natts);
+			memset(&isnull, 0xFF, sizeof(isnull));
+			index_deform_tuple(itup, itupdesc, values, isnull);
+
+			rel->rd_index->indnkeyatts = natts;
+
+			/*
+			 * Since the regression tests should pass when the instrumentation
+			 * patch is applied, be prepared for BuildIndexValueDescription() to
+			 * return NULL due to security considerations.
+			 */
+			lkey_desc = BuildIndexValueDescription(rel, values, isnull);
+		}
+
+		for (i = 0; i < scanKey->keysz; i++)
+		{
+			ScanKey cur = &scanKey->scankeys[i];
+
+			if (i != 0)
+			{
+				buf[idx] = ',';
+				idx++;
+			}
+
+			if (!(cur->sk_flags & SK_ISNULL))
+			{
+				if (cur->sk_subtype != InvalidOid)
+					getTypeOutputInfo(cur->sk_subtype,
+									  &typOutput, &varlenatype);
+				else
+					getTypeOutputInfo(rel->rd_opcintype[i],
+									  &typOutput, &varlenatype);
+				val = OidOutputFunctionCall(typOutput, cur->sk_argument);
+				if (val)
+				{
+					unsigned long tocopy = strnlen(val, 15);
+					memcpy(buf + idx, val, tocopy);
+					idx += tocopy;
+					pfree(val);
+				}
+				else
+				{
+					memcpy(buf + idx, "n/a", 3);
+					idx += 3;
+				}
+			}
+			else
+			{
+				memcpy(buf + idx, "null", 4);
+				idx += 4;
+			}
+		}
+		buf[idx] = 0;
+
+		elog(DEBUG1, "%s : %s tuple(%s) sk(%s)",
+			 extra, RelationGetRelationName(rel), lkey_desc ? lkey_desc : "N/A", buf);
+
+		/* Cleanup */
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
+		if (lkey_desc)
+			pfree(lkey_desc);
+	}
+}
+
+/*
+ * returns whether we're at the end of a scan.
+ * the scan position can be invalid even though we still
+ * should continue the scan. this happens for example when
+ * we're scanning with prefixDir!=postfixDir. when looking at the first
+ * prefix, we traverse the items within the prefix from max to min.
+ * if none of them match, we actually run off the start of the index,
+ * meaning none of the tuples within this prefix match. the scan pos becomes
+ * invalid, however, we do need to look further to the next prefix.
+ * therefore, this function still returns true in this particular case.
+ */
+static inline bool
+_bt_skip_is_valid(BTScanOpaque so, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	return BTScanPosIsValid(so->currPos) ||
+			(!_bt_skip_is_regular_mode(prefixDir, postfixDir) &&
+			 so->skipData->curPos.nextAction != SkipStateStop);
+}
+
+/* try finding the next tuple to skip to within the local tuple storage.
+ * local tuple storage is filled during _bt_readpage with all matching
+ * tuples on that page. if we can find the next prefix here it saves
+ * us doing a scan from root.
+ * Note that this optimization only works with _bt_regular_mode == true
+ * If this is not the case, the local tuple workspace will always only
+ * contain tuples of one specific prefix (_bt_readpage will stop at
+ * the end of a prefx)
+ */
+static bool
+_bt_try_in_page_skip(IndexScanDesc scan, ScanDirection prefixDir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPosItem *currItem;
+	BTSkip skip = so->skipData;
+	IndexTuple itup = NULL;
+	bool goback;
+	int low, high, starthigh, startlow;
+	int32		result,
+				cmpval;
+	BTScanInsert key = &so->skipData->curPos.skipScanKey;
+
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+	_bt_skip_update_scankey_for_prefix_skip(scan, scan->indexRelation, skip->prefix, itup, prefixDir);
+
+	_bt_set_bsearch_flags(key->scankeys[key->keysz - 1].sk_strategy, prefixDir, &key->nextkey, &goback);
+
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+	/* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
+
+	startlow = low = ScanDirectionIsForward(prefixDir) ? so->currPos.itemIndex + 1 : so->currPos.firstItem;
+	starthigh = high = ScanDirectionIsForward(prefixDir) ? so->currPos.lastItem : so->currPos.itemIndex - 1;
+
+	/*
+	 * If there are no keys on the page, return the first available slot. Note
+	 * this covers two cases: the page is really empty (no keys), or it
+	 * contains only a high key.  The latter case is possible after vacuuming.
+	 * This can never happen on an internal page, however, since they are
+	 * never empty (an internal page must have children).
+	 */
+	if (unlikely(high < low))
+		return false;
+
+	/*
+	 * Binary search to find the first key on the page >= scan key, or first
+	 * key > scankey when nextkey is true.
+	 *
+	 * For nextkey=false (cmpval=1), the loop invariant is: all slots before
+	 * 'low' are < scan key, all slots at or after 'high' are >= scan key.
+	 *
+	 * For nextkey=true (cmpval=0), the loop invariant is: all slots before
+	 * 'low' are <= scan key, all slots at or after 'high' are > scan key.
+	 *
+	 * We can fall out when high == low.
+	 */
+	high++;						/* establish the loop invariant for high */
+
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
+
+	while (high > low)
+	{
+		int mid = low + ((high - low) / 2);
+
+		/* We have low <= mid < high, so mid points at a real slot */
+
+		currItem = &so->currPos.items[mid];
+		itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+		result = _bt_compare_until(scan->indexRelation, key, itup, skip->prefix);
+
+		if (result >= cmpval)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	if (high > starthigh)
+		return false;
+
+	if (goback)
+	{
+		low--;
+		if (low < startlow)
+			return false;
+	}
+
+	so->currPos.itemIndex = low;
+
+	if (DEBUG1 >= log_min_messages || DEBUG1 >= client_min_messages)
+	{
+		currItem = &so->currPos.items[low];
+		itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+		debug_print(itup, &so->skipData->curPos.skipScanKey, scan->indexRelation, "skip-in-page");
+	}
+
+	return true;
+}
+
+/*
+ *  _bt_skip() -- Skip items that have the same prefix as the most recently
+ * 				  fetched index tuple.
+ *
+ * in: pinned, not locked
+ * out: pinned, not locked (unless end of scan, then unpinned)
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPosItem *currItem;
+	IndexTuple itup = NULL;
+	OffsetNumber curTupleOffnum = InvalidOffsetNumber;
+	BTSkipCompareResult cmp;
+	BTSkip skip = so->skipData;
+	OffsetNumber first;
+
+	/* in page skip only works when prefixDir == postfixDir */
+	if (!_bt_skip_is_regular_mode(prefixDir, postfixDir) || !_bt_try_in_page_skip(scan, prefixDir))
+	{
+		currItem = &so->currPos.items[so->currPos.itemIndex];
+		itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+		so->skipData->curPos.nextSkipIndex = so->skipData->prefix;
+		_bt_skip_once(scan, &itup, &curTupleOffnum, true, prefixDir, postfixDir);
+		_bt_skip_until_match(scan, &itup, &curTupleOffnum, prefixDir, postfixDir);
+		if (!_bt_skip_is_always_valid(so))
+			return false;
+
+		first = curTupleOffnum;
+		_bt_readpage(scan, postfixDir, &curTupleOffnum, _bt_skip_is_regular_mode(prefixDir, postfixDir));
+		if (DEBUG2 >= log_min_messages || DEBUG2 >= client_min_messages)
+		{
+			print_itup(BufferGetBlockNumber(so->currPos.buf), _bt_get_tuple_from_offset(so, first), NULL, scan->indexRelation,
+						"first item on page compared after skip");
+			print_itup(BufferGetBlockNumber(so->currPos.buf), _bt_get_tuple_from_offset(so, curTupleOffnum), NULL, scan->indexRelation,
+						"last item on page compared after skip");
+		}
+		_bt_compare_current_item(scan, _bt_get_tuple_from_offset(so, curTupleOffnum),
+								 IndexRelationGetNumberOfAttributes(scan->indexRelation),
+								 postfixDir, _bt_skip_is_regular_mode(prefixDir, postfixDir), &cmp);
+		_bt_determine_next_action(scan, &cmp, first, curTupleOffnum, postfixDir, &skip->curPos.nextAction);
+		skip->curPos.nextDirection = prefixDir;
+		skip->curPos.nextSkipIndex = cmp.prefixSkipIndex;
+		_bt_skip_update_scankey_after_read(scan, _bt_get_tuple_from_offset(so, curTupleOffnum), prefixDir, postfixDir);
+
+		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+	}
+
+	/* prepare for the call to _bt_next, because _bt_next increments this to get to the tuple we want to be at */
+	if (ScanDirectionIsForward(postfixDir))
+		so->currPos.itemIndex--;
+	else
+		so->currPos.itemIndex++;
+
+	return true;
+}
+
+IndexTuple
+_bt_get_tuple_from_offset(BTScanOpaque so, OffsetNumber curTupleOffnum)
+{
+	Page page = BufferGetPage(so->currPos.buf);
+	return (IndexTuple) PageGetItem(page, PageGetItemId(page, curTupleOffnum));
+}
+
+static IndexTuple
+_bt_get_tuple_from_offset_with_copy(BTScanOpaque so, OffsetNumber curTupleOffnum)
+{
+	Page page = BufferGetPage(so->currPos.buf);
+	IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, curTupleOffnum));
+	Size		itupsz = IndexTupleSize(itup);
+	memcpy(so->skipData->curPos.skipTuple, itup, itupsz);
+
+	return (IndexTuple) so->skipData->curPos.skipTuple;
+}
+
+static void
+_bt_determine_next_action(IndexScanDesc scan, BTSkipCompareResult *cmp, OffsetNumber firstOffnum, OffsetNumber lastOffnum, ScanDirection postfixDir, BTSkipState *nextAction)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	if (cmp->fullKeySkip)
+		*nextAction = SkipStateStop;
+	else if (ScanDirectionIsForward(postfixDir))
+	{
+		OffsetNumber firstItem = firstOffnum, lastItem = lastOffnum;
+		if (cmp->prefixSkip)
+		{
+			*nextAction = SkipStateSkip;
+		}
+		else
+		{
+			IndexTuple toCmp;
+			if (so->currPos.lastItem >= so->currPos.firstItem)
+				toCmp = _bt_get_tuple_from_offset_with_copy(so, so->currPos.items[so->currPos.lastItem].indexOffset);
+			else
+				toCmp = _bt_get_tuple_from_offset_with_copy(so, firstItem);
+			_bt_update_scankey_with_tuple(&so->skipData->currentTupleKey,
+										  scan->indexRelation, toCmp, RelationGetNumberOfAttributes(scan->indexRelation));
+			if (_bt_has_extra_quals_after_skip(so->skipData, postfixDir, so->skipData->prefix) && !cmp->equal &&
+					(cmp->prefixCmpResult != 0 ||
+					 _bt_compare_until(scan->indexRelation, &so->skipData->currentTupleKey,
+									   _bt_get_tuple_from_offset(so, lastItem), so->skipData->prefix) != 0))
+				*nextAction = SkipStateSkipExtra;
+			else
+				*nextAction = SkipStateNext;
+		}
+	}
+	else
+	{
+		OffsetNumber firstItem = lastOffnum, lastItem = firstOffnum;
+		if (cmp->prefixSkip)
+		{
+			*nextAction = SkipStateSkip;
+		}
+		else
+		{
+			IndexTuple toCmp;
+			if (so->currPos.lastItem >= so->currPos.firstItem)
+				toCmp = _bt_get_tuple_from_offset_with_copy(so, so->currPos.items[so->currPos.firstItem].indexOffset);
+			else
+				toCmp = _bt_get_tuple_from_offset_with_copy(so, lastItem);
+			_bt_update_scankey_with_tuple(&so->skipData->currentTupleKey,
+										  scan->indexRelation, toCmp, RelationGetNumberOfAttributes(scan->indexRelation));
+			if (_bt_has_extra_quals_after_skip(so->skipData, postfixDir, so->skipData->prefix) && !cmp->equal &&
+					(cmp->prefixCmpResult != 0 ||
+					 _bt_compare_until(scan->indexRelation, &so->skipData->currentTupleKey,
+									   _bt_get_tuple_from_offset(so, firstItem), so->skipData->prefix) != 0))
+				*nextAction = SkipStateSkipExtra;
+			else
+				*nextAction = SkipStateNext;
+		}
+	}
+}
+
+static inline bool
+_bt_should_prefix_skip(BTSkipCompareResult *cmp)
+{
+	return cmp->prefixSkip || cmp->prefixCmpResult != 0;
+}
+
+static inline void
+_bt_determine_next_action_after_skip(BTScanOpaque so, BTSkipCompareResult *cmp, ScanDirection prefixDir,
+									 ScanDirection postfixDir, int skipped, BTSkipState *nextAction)
+{
+	if (!_bt_skip_is_always_valid(so) || cmp->fullKeySkip)
+		*nextAction = SkipStateStop;
+	else if (cmp->equal && _bt_skip_is_regular_mode(prefixDir, postfixDir))
+		*nextAction = SkipStateNext;
+	else if (_bt_should_prefix_skip(cmp) && _bt_skip_is_regular_mode(prefixDir, postfixDir) &&
+			 ((ScanDirectionIsForward(prefixDir) && cmp->skCmpResult == -1) ||
+			  (ScanDirectionIsBackward(prefixDir) && cmp->skCmpResult == 1)))
+		*nextAction = SkipStateSkip;
+	else if (!_bt_skip_is_regular_mode(prefixDir, postfixDir) ||
+			 _bt_has_extra_quals_after_skip(so->skipData, postfixDir, skipped) ||
+			 cmp->prefixCmpResult != 0)
+		*nextAction = SkipStateSkipExtra;
+	else
+		*nextAction = SkipStateNext;
+}
+
+static inline void
+_bt_determine_next_action_after_skip_extra(BTScanOpaque so, BTSkipCompareResult *cmp, BTSkipState *nextAction)
+{
+	if (!_bt_skip_is_always_valid(so) || cmp->fullKeySkip)
+		*nextAction = SkipStateStop;
+	else if (cmp->equal)
+		*nextAction = SkipStateNext;
+	else if (_bt_should_prefix_skip(cmp))
+		*nextAction = SkipStateSkip;
+	else
+		*nextAction = SkipStateNext;
+}
+
+/* just a debug function that prints a scankey. will be removed for final patch */
+static inline void
+_print_skey(IndexScanDesc scan, BTScanInsert scanKey)
+{
+	Oid			typOutput;
+	bool		varlenatype;
+	char	   *val;
+	int i;
+	Relation rel = scan->indexRelation;
+
+	for (i = 0; i < scanKey->keysz; i++)
+	{
+		ScanKey cur = &scanKey->scankeys[i];
+		if (!IsCatalogRelation(rel))
+		{
+			if (!(cur->sk_flags & SK_ISNULL))
+			{
+				if (cur->sk_subtype != InvalidOid)
+					getTypeOutputInfo(cur->sk_subtype,
+									  &typOutput, &varlenatype);
+				else
+					getTypeOutputInfo(rel->rd_opcintype[i],
+									  &typOutput, &varlenatype);
+				val = OidOutputFunctionCall(typOutput, cur->sk_argument);
+				if (val)
+				{
+					elog(DEBUG1, "%s sk attr %d val: %s (%s, %s)",
+						 RelationGetRelationName(rel), i, val,
+						 (cur->sk_flags & SK_BT_NULLS_FIRST) != 0 ? "NULLS FIRST" : "NULLS LAST",
+						 (cur->sk_flags & SK_BT_DESC) != 0 ? "DESC" : "ASC");
+					pfree(val);
+				}
+			}
+			else
+			{
+				elog(DEBUG1, "%s sk attr %d val: NULL (%s, %s)",
+					 RelationGetRelationName(rel), i,
+					 (cur->sk_flags & SK_BT_NULLS_FIRST) != 0 ? "NULLS FIRST" : "NULLS LAST",
+					 (cur->sk_flags & SK_BT_DESC) != 0 ? "DESC" : "ASC");
+			}
+		}
+	}
+}
+
+bool
+_bt_checkkeys_skip(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+				   ScanDirection dir, bool *continuescan, int *prefixskipindex)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+
+	bool match = _bt_checkkeys(scan, tuple, tupnatts, dir, continuescan, prefixskipindex);
+	int prefixCmpResult = _bt_compare_until(scan->indexRelation, &skip->curPos.skipScanKey, tuple, skip->prefix);
+	if (*prefixskipindex == -1 && prefixCmpResult != 0)
+	{
+		*prefixskipindex = skip->prefix;
+		return false;
+	}
+	else
+	{
+		bool newcont;
+		_bt_checkkeys_threeway(scan, tuple, tupnatts, dir, &newcont, prefixskipindex);
+		if (*prefixskipindex == -1 && prefixCmpResult != 0)
+		{
+			*prefixskipindex = skip->prefix;
+			return false;
+		}
+	}
+	return match;
+}
+
+/*
+ * Compare a scankey with a given tuple but only the first prefix columns
+ * This function returns 0 if the first 'prefix' columns are equal
+ * -1 if key < itup for the first prefix columns
+ * 1 if key > itup for the first prefix columns
+ */
+int32
+_bt_compare_until(Relation rel,
+			BTScanInsert key,
+			IndexTuple itup,
+			int prefix)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ScanKey		scankey;
+	int			ncmpkey;
+
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+
+	ncmpkey = Min(prefix, key->keysz);
+	scankey = key->scankeys;
+	for (int i = 1; i <= ncmpkey; i++)
+	{
+		Datum		datum;
+		bool		isNull;
+		int32		result;
+
+		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+
+		/* see comments about NULLs handling in btbuild */
+		if (scankey->sk_flags & SK_ISNULL)	/* key is NULL */
+		{
+			if (isNull)
+				result = 0;		/* NULL "=" NULL */
+			else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+				result = -1;	/* NULL "<" NOT_NULL */
+			else
+				result = 1;		/* NULL ">" NOT_NULL */
+		}
+		else if (isNull)		/* key is NOT_NULL and item is NULL */
+		{
+			if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+				result = 1;		/* NOT_NULL ">" NULL */
+			else
+				result = -1;	/* NOT_NULL "<" NULL */
+		}
+		else
+		{
+			/*
+			 * The sk_func needs to be passed the index value as left arg and
+			 * the sk_argument as right arg (they might be of different
+			 * types).  Since it is convenient for callers to think of
+			 * _bt_compare as comparing the scankey to the index item, we have
+			 * to flip the sign of the comparison result.  (Unless it's a DESC
+			 * column, in which case we *don't* flip the sign.)
+			 */
+			result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+													 scankey->sk_collation,
+													 datum,
+													 scankey->sk_argument));
+
+			if (!(scankey->sk_flags & SK_BT_DESC))
+				INVERT_COMPARE_RESULT(result);
+		}
+
+		/* if the keys are unequal, return the difference */
+		if (result != 0)
+			return result;
+
+		scankey++;
+	}
+	return 0;
+}
+
+
+/*
+ * Create initial scankeys for skipping and stores them in the skipData
+ * structure
+ */
+void
+_bt_skip_create_scankeys(Relation rel, BTScanOpaque so)
+{
+	int keysCount;
+	BTSkip skip = so->skipData;
+	StrategyNumber stratTotal;
+	ScanKey		keyPointers[INDEX_MAX_KEYS];
+	bool goback;
+	/* we need to create both forward and backward keys because the scan direction
+	 * may change at any moment in scans with a cursor.
+	 * we could technically delay creation of the second until first use as an optimization
+	 * but that is not implemented yet.
+	 */
+	keysCount = _bt_choose_scan_keys(so->keyData, so->numberOfKeys, ForwardScanDirection,
+									 keyPointers, skip->fwdNotNullKeys, &stratTotal, skip->prefix);
+	_bt_create_insertion_scan_key(rel, ForwardScanDirection, keyPointers, keysCount,
+								  &skip->fwdScanKey, &stratTotal, &goback);
+
+	keysCount = _bt_choose_scan_keys(so->keyData, so->numberOfKeys, BackwardScanDirection,
+									 keyPointers, skip->bwdNotNullKeys, &stratTotal, skip->prefix);
+	_bt_create_insertion_scan_key(rel, BackwardScanDirection, keyPointers, keysCount,
+								  &skip->bwdScanKey, &stratTotal, &goback);
+
+	_bt_metaversion(rel, &skip->curPos.skipScanKey.heapkeyspace,
+					&skip->curPos.skipScanKey.allequalimage);
+	skip->curPos.skipScanKey.anynullkeys = false; /* unused */
+	skip->curPos.skipScanKey.nextkey = false;
+	skip->curPos.skipScanKey.pivotsearch = false;
+	skip->curPos.skipScanKey.scantid = NULL;
+	skip->curPos.skipScanKey.keysz = 0;
+
+	/* setup scankey for the current tuple as well. it's not necessarily that
+	 * we will use the data from the current tuple already,
+	 * but we need the rest of the data structure to be set up correctly
+	 * for when we use it to create skip->curPos.skipScanKey keys later
+	 */
+	_bt_mkscankey(rel, NULL, &skip->currentTupleKey);
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * 								within a page, specified by the buffer.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+						Buffer buf)
+{
+	/* @todo: optimization is still possible here to
+	 * only check either the low or the high, depending on
+	 * which direction *we came from* AND which direction
+	 * *we are planning to scan*
+	 */
+	OffsetNumber low, high;
+	Page page = BufferGetPage(buf);
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	int			ans_lo, ans_hi;
+
+	low = P_FIRSTDATAKEY(opaque);
+	high = PageGetMaxOffsetNumber(page);
+
+	if (unlikely(high < low))
+		return false;
+
+	ans_lo = _bt_compare(scan->indexRelation,
+					   key, page, low);
+	ans_hi = _bt_compare(scan->indexRelation,
+					   key, page, high);
+	if (key->nextkey)
+	{
+		/* sk < last && sk >= first */
+		return ans_lo >= 0 && ans_hi == -1;
+	}
+	else
+	{
+		/* sk <= last && sk > first */
+		return ans_lo == 1 && ans_hi <= 0;
+	}
+}
+
+/* in: pinned and locked, out: pinned and locked (unless end of scan) */
+static void
+_bt_skip_find(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+			  BTScanInsert scanKey, ScanDirection dir)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	OffsetNumber offnum;
+	BTStack stack;
+	Buffer buf;
+	bool goback;
+	Page		page;
+	BTPageOpaque opaque;
+	OffsetNumber minoff;
+	Relation rel = scan->indexRelation;
+	bool fromroot = true;
+
+	_bt_set_bsearch_flags(scanKey->scankeys[scanKey->keysz - 1].sk_strategy, dir, &scanKey->nextkey, &goback);
+
+	if ((DEBUG2 >= log_min_messages || DEBUG2 >= client_min_messages) && !IsCatalogRelation(rel))
+	{
+		if (*curTuple != NULL)
+			print_itup(BufferGetBlockNumber(so->currPos.buf), *curTuple, NULL, rel,
+						"before btree search");
+
+		elog(DEBUG1, "%s searching tree with %d keys, nextkey=%d, goback=%d",
+			 RelationGetRelationName(rel), scanKey->keysz, scanKey->nextkey,
+			 goback);
+
+		_print_skey(scan, scanKey);
+	}
+
+	if (*curTupleOffnum == InvalidOffsetNumber)
+	{
+		BTScanPosUnpinIfPinned(so->currPos);
+	}
+	else
+	{
+		if (_bt_scankey_within_page(scan, scanKey, so->currPos.buf))
+		{
+			elog(DEBUG1, "sk found within current page");
+
+			offnum = _bt_binsrch(scan->indexRelation, scanKey, so->currPos.buf);
+			fromroot = false;
+		}
+		else
+		{
+			_bt_unlockbuf(rel, so->currPos.buf);
+			ReleaseBuffer(so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+		}
+	}
+
+	/*
+	 * We haven't found scan key within the current page, so let's scan from
+	 * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+	 * number
+	 */
+	if (fromroot)
+	{
+		stack = _bt_search(scan->indexRelation, scanKey,
+						   &buf, BT_READ, scan->xs_snapshot);
+		_bt_freestack(stack);
+		so->currPos.buf = buf;
+
+		offnum = _bt_binsrch(scan->indexRelation, scanKey, buf);
+
+		/* Lock the page for SERIALIZABLE transactions */
+		PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+						  scan->xs_snapshot);
+	}
+
+	page = BufferGetPage(so->currPos.buf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	if (goback)
+	{
+		offnum = OffsetNumberPrev(offnum);
+		minoff = P_FIRSTDATAKEY(opaque);
+		if (offnum < minoff)
+		{
+			_bt_unlockbuf(rel, so->currPos.buf);
+			if (!_bt_step_back_page(scan, curTuple, curTupleOffnum))
+				return;
+			page = BufferGetPage(so->currPos.buf);
+			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			offnum = PageGetMaxOffsetNumber(page);
+		}
+	}
+	else if (offnum > PageGetMaxOffsetNumber(page))
+	{
+		BlockNumber next = opaque->btpo_next;
+		_bt_unlockbuf(rel, so->currPos.buf);
+		if (!_bt_step_forward_page(scan, next, curTuple, curTupleOffnum))
+			return;
+		page = BufferGetPage(so->currPos.buf);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		offnum = P_FIRSTDATAKEY(opaque);
+	}
+
+	/* We know in which direction to look */
+	_bt_initialize_more_data(so, dir);
+
+	*curTupleOffnum = offnum;
+	*curTuple = _bt_get_tuple_from_offset(so, offnum);
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	if (DEBUG2 >= log_min_messages || DEBUG2 >= client_min_messages)
+		print_itup(BufferGetBlockNumber(so->currPos.buf), *curTuple, NULL, rel,
+					"after btree search");
+}
+
+static inline bool
+_bt_step_one_page(IndexScanDesc scan, ScanDirection dir, IndexTuple *curTuple,
+				  OffsetNumber *curTupleOffnum)
+{
+	if (ScanDirectionIsForward(dir))
+	{
+		BTScanOpaque so = (BTScanOpaque) scan->opaque;
+		return _bt_step_forward_page(scan, so->currPos.nextPage, curTuple, curTupleOffnum);
+	}
+	else
+	{
+		return _bt_step_back_page(scan, curTuple, curTupleOffnum);
+	}
+}
+
+/* in: possibly pinned, but unlocked, out: pinned and locked */
+bool
+_bt_step_forward_page(IndexScanDesc scan, BlockNumber next, IndexTuple *curTuple,
+					  OffsetNumber *curTupleOffnum)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation rel = scan->indexRelation;
+	BlockNumber blkno = next;
+	Page page;
+	BTPageOpaque opaque;
+
+	Assert(BTScanPosIsValid(so->currPos));
+
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_bt_killitems(scan);
+
+	/*
+	 * Before we modify currPos, make a copy of the page data if there was a
+	 * mark position that needs it.
+	 */
+	if (so->markItemIndex >= 0)
+	{
+		/* bump pin on current buffer for assignment to mark buffer */
+		if (BTScanPosIsPinned(so->currPos))
+			IncrBufferRefCount(so->currPos.buf);
+		memcpy(&so->markPos, &so->currPos,
+			   offsetof(BTScanPosData, items[1]) +
+			   so->currPos.lastItem * sizeof(BTScanPosItem));
+		if (so->markTuples)
+			memcpy(so->markTuples, so->currTuples,
+				   so->currPos.nextTupleOffset);
+		so->markPos.itemIndex = so->markItemIndex;
+		if (so->skipData)
+			memcpy(&so->skipData->markPos, &so->skipData->curPos,
+				   sizeof(BTSkipPosData));
+		so->markItemIndex = -1;
+	}
+
+	/* Remember we left a page with data */
+	so->currPos.moreLeft = true;
+
+	/* release the previous buffer, if pinned */
+	BTScanPosUnpinIfPinned(so->currPos);
+
+	{
+		for (;;)
+		{
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
+			if (blkno == P_NONE)
+			{
+				_bt_parallel_done(scan);
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+
+			/* check for interrupts while we're not holding any buffer lock */
+			CHECK_FOR_INTERRUPTS();
+			/* step right one page */
+			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			page = BufferGetPage(so->currPos.buf);
+			TestForOldSnapshot(scan->xs_snapshot, rel, page);
+			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			/* check for deleted page */
+			if (!P_IGNORE(opaque))
+			{
+				PredicateLockPage(rel, blkno, scan->xs_snapshot);
+				*curTupleOffnum = P_FIRSTDATAKEY(opaque);
+				*curTuple = _bt_get_tuple_from_offset(so, *curTupleOffnum);
+				break;
+			}
+
+			blkno = opaque->btpo_next;
+			_bt_relbuf(rel, so->currPos.buf);
+		}
+	}
+
+	return true;
+}
+
+/* in: possibly pinned, but unlocked, out: pinned and locked */
+bool
+_bt_step_back_page(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(BTScanPosIsValid(so->currPos));
+
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_bt_killitems(scan);
+
+	/*
+	 * Before we modify currPos, make a copy of the page data if there was a
+	 * mark position that needs it.
+	 */
+	if (so->markItemIndex >= 0)
+	{
+		/* bump pin on current buffer for assignment to mark buffer */
+		if (BTScanPosIsPinned(so->currPos))
+			IncrBufferRefCount(so->currPos.buf);
+		memcpy(&so->markPos, &so->currPos,
+			   offsetof(BTScanPosData, items[1]) +
+			   so->currPos.lastItem * sizeof(BTScanPosItem));
+		if (so->markTuples)
+			memcpy(so->markTuples, so->currTuples,
+				   so->currPos.nextTupleOffset);
+		if (so->skipData)
+			memcpy(&so->skipData->markPos, &so->skipData->curPos,
+				   sizeof(BTSkipPosData));
+		so->markPos.itemIndex = so->markItemIndex;
+		so->markItemIndex = -1;
+	}
+
+	/* Remember we left a page with data */
+	so->currPos.moreRight = true;
+
+	/* Not parallel, so just use our own notion of the current page */
+
+	{
+		Relation	rel;
+		Page		page;
+		BTPageOpaque opaque;
+
+		rel = scan->indexRelation;
+
+		if (BTScanPosIsPinned(so->currPos))
+			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
+		else
+			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+
+		for (;;)
+		{
+			/* Step to next physical page */
+			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
+											scan->xs_snapshot);
+
+			/* if we're physically at end of index, return failure */
+			if (so->currPos.buf == InvalidBuffer)
+			{
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+
+			/*
+			 * Okay, we managed to move left to a non-deleted page. Done if
+			 * it's not half-dead and contains matching tuples. Else loop back
+			 * and do it all again.
+			 */
+			page = BufferGetPage(so->currPos.buf);
+			TestForOldSnapshot(scan->xs_snapshot, rel, page);
+			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			if (!P_IGNORE(opaque))
+			{
+				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
+				*curTupleOffnum = PageGetMaxOffsetNumber(page);
+				*curTuple = _bt_get_tuple_from_offset(so, *curTupleOffnum);
+				break;
+			}
+		}
+	}
+
+	return true;
+}
+
+/* holds lock as long as curTupleOffnum != InvalidOffsetNumber */
+bool
+_bt_skip_find_next(IndexScanDesc scan, IndexTuple curTuple, OffsetNumber curTupleOffnum,
+				   ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	BTSkipCompareResult cmp;
+
+	while (_bt_skip_is_valid(so, prefixDir, postfixDir))
+	{
+		bool found;
+		_bt_skip_until_match(scan, &curTuple, &curTupleOffnum, prefixDir, postfixDir);
+
+		while (_bt_skip_is_always_valid(so))
+		{
+			OffsetNumber first = curTupleOffnum;
+			found = _bt_readpage(scan, postfixDir, &curTupleOffnum,
+								 _bt_skip_is_regular_mode(prefixDir, postfixDir));
+			if (DEBUG2 >= log_min_messages || DEBUG2 >= client_min_messages)
+			{
+				print_itup(BufferGetBlockNumber(so->currPos.buf),
+						   _bt_get_tuple_from_offset(so, first), NULL, scan->indexRelation,
+							"first item on page compared");
+				print_itup(BufferGetBlockNumber(so->currPos.buf),
+						   _bt_get_tuple_from_offset(so, curTupleOffnum), NULL, scan->indexRelation,
+							"last item on page compared");
+			}
+			_bt_compare_current_item(scan, _bt_get_tuple_from_offset(so, curTupleOffnum),
+									 IndexRelationGetNumberOfAttributes(scan->indexRelation),
+									 postfixDir, _bt_skip_is_regular_mode(prefixDir, postfixDir), &cmp);
+			_bt_determine_next_action(scan, &cmp, first, curTupleOffnum,
+									  postfixDir, &skip->curPos.nextAction);
+			skip->curPos.nextDirection = prefixDir;
+			skip->curPos.nextSkipIndex = cmp.prefixSkipIndex;
+
+			if (found)
+			{
+				_bt_skip_update_scankey_after_read(scan, _bt_get_tuple_from_offset(so, curTupleOffnum),
+												   prefixDir, postfixDir);
+				return true;
+			}
+			else if (skip->curPos.nextAction == SkipStateNext)
+			{
+				if (curTupleOffnum != InvalidOffsetNumber)
+					_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+				if (!_bt_step_one_page(scan, postfixDir, &curTuple, &curTupleOffnum))
+					return false;
+			}
+			else if (skip->curPos.nextAction == SkipStateSkip || skip->curPos.nextAction == SkipStateSkipExtra)
+			{
+				curTuple = _bt_get_tuple_from_offset(so, curTupleOffnum);
+				_bt_skip_update_scankey_after_read(scan, curTuple, prefixDir, postfixDir);
+				_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+				curTupleOffnum = InvalidOffsetNumber;
+				curTuple = NULL;
+				break;
+			}
+			else if (skip->curPos.nextAction == SkipStateStop)
+			{
+				_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+				BTScanPosUnpinIfPinned(so->currPos);
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+			else
+			{
+				Assert(false);
+			}
+		}
+	}
+	return false;
+}
+
+void
+_bt_skip_until_match(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+					 ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	while (_bt_skip_is_valid(so, prefixDir, postfixDir) &&
+		   (skip->curPos.nextAction == SkipStateSkip || skip->curPos.nextAction == SkipStateSkipExtra))
+	{
+		_bt_skip_once(scan, curTuple, curTupleOffnum,
+					  skip->curPos.nextAction == SkipStateSkip, prefixDir, postfixDir);
+	}
+}
+
+void
+_bt_compare_current_item(IndexScanDesc scan, IndexTuple tuple, int tupnatts, ScanDirection dir,
+						 bool isRegularMode, BTSkipCompareResult* cmp)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+
+	if (_bt_skip_is_always_valid(so))
+	{
+		bool continuescan = true;
+
+		cmp->equal = _bt_checkkeys(scan, tuple, tupnatts, dir, &continuescan, &cmp->prefixSkipIndex);
+		cmp->fullKeySkip = !continuescan;
+		/* prefix can be smaller than scankey due to extra quals being added
+		 * therefore we need to compare both. @todo this can be optimized into one function call */
+		cmp->prefixCmpResult = _bt_compare_until(scan->indexRelation, &skip->curPos.skipScanKey, tuple, skip->prefix);
+		cmp->skCmpResult = _bt_compare_until(scan->indexRelation,
+											 &skip->curPos.skipScanKey, tuple, skip->curPos.skipScanKey.keysz);
+		if (cmp->prefixSkipIndex == -1)
+		{
+			if (isRegularMode)
+			{
+				cmp->prefixSkip = false;
+				cmp->prefixSkipIndex = skip->prefix;
+			}
+			else
+			{
+				cmp->prefixSkip = ScanDirectionIsForward(dir) ? cmp->prefixCmpResult < 0 : cmp->prefixCmpResult > 0;
+				cmp->prefixSkipIndex = skip->prefix;
+			}
+		}
+		else
+		{
+			int newskip = -1;
+			_bt_checkkeys_threeway(scan, tuple, tupnatts, dir, &continuescan, &newskip);
+			if (newskip != -1)
+			{
+				cmp->prefixSkip = true;
+				cmp->prefixSkipIndex = newskip;
+			}
+			else
+			{
+				if (isRegularMode)
+				{
+					cmp->prefixSkip = false;
+					cmp->prefixSkipIndex = skip->prefix;
+				}
+				else
+				{
+					cmp->prefixSkip = ScanDirectionIsForward(dir) ? cmp->prefixCmpResult < 0 : cmp->prefixCmpResult > 0;
+					cmp->prefixSkipIndex = skip->prefix;
+				}
+			}
+		}
+
+		if (DEBUG2 >= log_min_messages || DEBUG2 >= client_min_messages)
+		{
+			print_itup(BufferGetBlockNumber(so->currPos.buf), tuple, NULL, scan->indexRelation,
+						"compare item");
+			_print_skey(scan, &skip->curPos.skipScanKey);
+			elog(DEBUG1, "result: eq: %d fkskip: %d pfxskip: %d prefixcmpres: %d prefixskipidx: %d", cmp->equal, cmp->fullKeySkip,
+				 _bt_should_prefix_skip(cmp), cmp->prefixCmpResult, cmp->prefixSkipIndex);
+		}
+	}
+	else
+	{
+		/* we cannot stop the scan if !isRegularMode - then we do need to skip to the next prefix */
+		cmp->fullKeySkip = isRegularMode;
+		cmp->equal = false;
+		cmp->prefixCmpResult = -2;
+		cmp->prefixSkip = true;
+		cmp->prefixSkipIndex = skip->prefix;
+		cmp->skCmpResult = -2;
+	}
+}
+
+void
+_bt_skip_once(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+			  bool forceSkip, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	BTSkipCompareResult cmp;
+	bool doskip = forceSkip;
+	int skipIndex = skip->curPos.nextSkipIndex;
+	skip->curPos.nextAction = SkipStateSkipExtra;
+
+	while (doskip)
+	{
+		int toskip = skipIndex;
+		if (*curTuple != NULL)
+		{
+			if (skip->prefix <= skipIndex || !_bt_skip_is_regular_mode(prefixDir, postfixDir))
+			{
+				toskip = skip->prefix;
+			}
+
+			_bt_skip_update_scankey_for_prefix_skip(scan, scan->indexRelation,
+													toskip, *curTuple, prefixDir);
+		}
+
+		if (DEBUG1 >= log_min_messages || DEBUG1 >= client_min_messages)
+		{
+			debug_print(*curTuple, &so->skipData->curPos.skipScanKey, scan->indexRelation, "skip");
+		}
+
+		_bt_skip_find(scan, curTuple, curTupleOffnum, &skip->curPos.skipScanKey, prefixDir);
+
+		if (_bt_skip_is_always_valid(so))
+		{
+			_bt_skip_update_scankey_for_extra_skip(scan, scan->indexRelation,
+												   prefixDir, prefixDir, true, *curTuple);
+			_bt_compare_current_item(scan, *curTuple,
+									 IndexRelationGetNumberOfAttributes(scan->indexRelation),
+									 prefixDir,
+									 _bt_skip_is_regular_mode(prefixDir, postfixDir), &cmp);
+			skipIndex = cmp.prefixSkipIndex;
+			_bt_determine_next_action_after_skip(so, &cmp, prefixDir,
+												 postfixDir, toskip, &skip->curPos.nextAction);
+		}
+		else
+		{
+			skip->curPos.nextAction = SkipStateStop;
+		}
+		doskip = skip->curPos.nextAction == SkipStateSkip;
+	}
+	if (skip->curPos.nextAction != SkipStateStop && skip->curPos.nextAction != SkipStateNext)
+		_bt_skip_extra_conditions(scan, curTuple, curTupleOffnum, prefixDir, postfixDir, &cmp);
+}
+
+void
+_bt_skip_extra_conditions(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+						  ScanDirection prefixDir, ScanDirection postfixDir, BTSkipCompareResult *cmp)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	bool regularMode = _bt_skip_is_regular_mode(prefixDir, postfixDir);
+	if (_bt_skip_is_always_valid(so))
+	{
+		do
+		{
+			if (*curTuple != NULL)
+				_bt_skip_update_scankey_for_extra_skip(scan, scan->indexRelation,
+													   postfixDir, prefixDir, false, *curTuple);
+			if (DEBUG1 >= log_min_messages || DEBUG1 >= client_min_messages)
+			{
+				debug_print(*curTuple, &so->skipData->curPos.skipScanKey, scan->indexRelation, "skip-extra");
+			}
+			_bt_skip_find(scan, curTuple, curTupleOffnum, &skip->curPos.skipScanKey, postfixDir);
+			_bt_compare_current_item(scan, *curTuple,
+									 IndexRelationGetNumberOfAttributes(scan->indexRelation),
+									 postfixDir, _bt_skip_is_regular_mode(prefixDir, postfixDir), cmp);
+		} while (regularMode && cmp->prefixCmpResult != 0 && !cmp->equal && !cmp->fullKeySkip);
+		skip->curPos.nextSkipIndex = cmp->prefixSkipIndex;
+	}
+	_bt_determine_next_action_after_skip_extra(so, cmp, &skip->curPos.nextAction);
+}
+
+static void
+_bt_skip_update_scankey_after_read(IndexScanDesc scan, IndexTuple curTuple,
+								   ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	if (skip->curPos.nextAction == SkipStateSkip)
+	{
+		int toskip = skip->curPos.nextSkipIndex;
+		if (skip->prefix <= skip->curPos.nextSkipIndex ||
+				!_bt_skip_is_regular_mode(prefixDir, postfixDir))
+		{
+			toskip = skip->prefix;
+		}
+
+		if (_bt_skip_is_regular_mode(prefixDir, postfixDir))
+			_bt_skip_update_scankey_for_prefix_skip(scan, scan->indexRelation,
+													toskip, curTuple, prefixDir);
+		else
+			_bt_skip_update_scankey_for_prefix_skip(scan, scan->indexRelation,
+													toskip, NULL, prefixDir);
+	}
+	else if (skip->curPos.nextAction == SkipStateSkipExtra)
+	{
+		_bt_skip_update_scankey_for_extra_skip(scan, scan->indexRelation,
+											   postfixDir, prefixDir, false, curTuple);
+	}
+}
+
+static inline int
+_bt_compare_one(ScanKey scankey, Datum datum2, bool isNull2)
+{
+	int32		result;
+	Datum datum1 = scankey->sk_argument;
+	bool isNull1 = scankey->sk_flags & SK_ISNULL;
+	/* see comments about NULLs handling in btbuild */
+	if (isNull1)	/* key is NULL */
+	{
+		if (isNull2)
+			result = 0;		/* NULL "=" NULL */
+		else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;	/* NULL "<" NOT_NULL */
+		else
+			result = 1;		/* NULL ">" NOT_NULL */
+	}
+	else if (isNull2)		/* key is NOT_NULL and item is NULL */
+	{
+		if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;		/* NOT_NULL ">" NULL */
+		else
+			result = -1;	/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * The sk_func needs to be passed the index value as left arg and
+		 * the sk_argument as right arg (they might be of different
+		 * types).  Since it is convenient for callers to think of
+		 * _bt_compare as comparing the scankey to the index item, we have
+		 * to flip the sign of the comparison result.  (Unless it's a DESC
+		 * column, in which case we *don't* flip the sign.)
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+												 scankey->sk_collation,
+												 datum2,
+												 datum1));
+
+		if (!(scankey->sk_flags & SK_BT_DESC))
+			INVERT_COMPARE_RESULT(result);
+	}
+	return result;
+}
+
+/*
+ * set up new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_scankey_with_tuple(BTScanInsert insertKey, Relation indexRel, IndexTuple itup, int numattrs)
+{
+	TupleDesc		itupdesc;
+	int				i;
+	ScanKey			scankeys = insertKey->scankeys;
+
+	insertKey->keysz = numattrs;
+	itupdesc = RelationGetDescr(indexRel);
+	for (i = 0; i < numattrs; i++)
+	{
+		Datum datum;
+		bool null;
+		int flags;
+
+		datum = index_getattr(itup, i + 1, itupdesc, &null);
+		flags = (null ? SK_ISNULL : 0) |
+				(indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+		scankeys[i].sk_flags = flags;
+		scankeys[i].sk_argument = datum;
+	}
+}
+
+/* copy the elements important to a skip from one insertion sk to another */
+static inline void
+_bt_copy_scankey(BTScanInsert to, BTScanInsert from, int numattrs)
+{
+	memcpy(to->scankeys, from->scankeys, sizeof(ScanKeyData) * (unsigned long)numattrs);
+	to->nextkey = from->nextkey;
+	to->keysz = numattrs;
+}
+
+/*
+ * Updates the existing scankey for skipping to the next prefix
+ * alwaysUsePrefix determines how many attrs the scankey will have
+ * when true, it will always have skip->prefix number of attributes,
+ * otherwise, the value can be less, which will be determined by the comparison
+ * result with the current tuple.
+ * for example, a SELECT * FROM tbl WHERE b<2, index (a,b,c) and when skipping with prefix size=2
+ * if we encounter the tuple (1,3,1) - this does not match the qual b<2. however, we also know that
+ * it is not useful to skip to any next qual with prefix=2 (eg. (1,4)), because that will definitely not
+ * match either. However, we do want to skip to eg. (2,0). Therefore, we skip over prefix=1 in this case.
+ *
+ * the provided itup may be null. this happens when we don't want to use the current tuple to update
+ * the scankey, but instead want to use the existing curPos.skipScanKey to fill currentTupleKey. this accounts
+ * for some edge cases.
+ */
+static void
+_bt_skip_update_scankey_for_prefix_skip(IndexScanDesc scan, Relation indexRel,
+										int prefix, IndexTuple itup, ScanDirection prefixDir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	/* we use skip->prefix is alwaysUsePrefix is set or if skip->prefix is smaller than whatever the
+	 * comparison result provided, such that we never skip more than skip->prefix
+	 */
+	int numattrs = prefix;
+
+	if (itup != NULL)
+	{
+		Size		itupsz = IndexTupleSize(itup);
+		memcpy(so->skipData->curPos.skipTuple, itup, itupsz);
+
+		_bt_update_scankey_with_tuple(&skip->currentTupleKey, indexRel, (IndexTuple)so->skipData->curPos.skipTuple, numattrs);
+		_bt_copy_scankey(&skip->curPos.skipScanKey, &skip->currentTupleKey, numattrs);
+	}
+	else
+	{
+		skip->curPos.skipScanKey.keysz = numattrs;
+		_bt_copy_scankey(&skip->currentTupleKey, &skip->curPos.skipScanKey, numattrs);
+	}
+	/* update strategy for last attribute as we will use this to determine the rest of the
+	 * rest of the flags (goback) when doing the actual tree search
+	 */
+	skip->currentTupleKey.scankeys[numattrs - 1].sk_strategy =
+			skip->curPos.skipScanKey.scankeys[numattrs - 1].sk_strategy =
+			ScanDirectionIsForward(prefixDir) ? BTGreaterStrategyNumber : BTLessStrategyNumber;
+}
+
+/* update the scankey for skipping the 'extra' conditions, opportunities
+ * that arise when we have just skipped to a new prefix and can try to skip
+ * within the prefix to the right tuple by using extra quals when available
+ *
+ * @todo as an optimization it should be possible to optimize calls to this function
+ * and to _bt_skip_update_scankey_for_prefix_skip to some more specific functions that
+ * will need to do less copying of data.
+ */
+void
+_bt_skip_update_scankey_for_extra_skip(IndexScanDesc scan, Relation indexRel, ScanDirection curDir,
+									   ScanDirection prefixDir, bool prioritizeEqual, IndexTuple itup)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	BTScanInsert toCopy;
+	int i, left, lastNonTuple = skip->prefix;
+
+	/* first make sure that currentTupleKey is correct at all times */
+	_bt_skip_update_scankey_for_prefix_skip(scan, indexRel, skip->prefix, itup, prefixDir);
+	/* then do the actual work to setup curPos.skipScanKey - distinguish between work that depends on overallDir
+	 * (those attributes between attribute number 1 and 'prefix' inclusive)
+	 * and work that depends on curDir
+	 * (those attributes between attribute number 'prefix' + 1 and fwdScanKey.keysz inclusive)
+	 */
+	if (ScanDirectionIsForward(prefixDir))
+	{
+		/*
+		 * if overallDir is Forward, we need to choose between fwdScanKey or
+		 * currentTupleKey. we need to choose the most restrictive one -
+		 * in most cases this means choosing eg. a>5 over a=2 when scanning forward,
+		 * unless prioritizeEqual is set. this is done for certain special cases
+		 */
+		for (i = 0; i < skip->prefix; i++)
+		{
+			ScanKey scankey = &skip->fwdScanKey.scankeys[i];
+			ScanKey scankeyItem = &skip->currentTupleKey.scankeys[i];
+			if (scankey->sk_attno != 0 && (_bt_compare_one(scankey, scankeyItem->sk_argument, scankeyItem->sk_flags & SK_ISNULL) > 0
+										   || (prioritizeEqual && scankey->sk_strategy == BTEqualStrategyNumber)))
+			{
+				memcpy(skip->curPos.skipScanKey.scankeys + i, scankey, sizeof(ScanKeyData));
+				lastNonTuple = i;
+			}
+			else
+			{
+				if (lastNonTuple < i)
+					break;
+				memcpy(skip->curPos.skipScanKey.scankeys + i, scankeyItem, sizeof(ScanKeyData));
+			}
+			/* for now choose equal here - it could actually be improved a bit @todo by choosing the strategy
+			 * from the scankeys, but it doesn't matter a lot
+			 */
+			skip->curPos.skipScanKey.scankeys[i].sk_strategy = BTEqualStrategyNumber;
+		}
+	}
+	else
+	{
+		/* similar for backward but in opposite direction */
+		for (i = 0; i < skip->prefix; i++)
+		{
+			ScanKey scankey = &skip->bwdScanKey.scankeys[i];
+			ScanKey scankeyItem = &skip->currentTupleKey.scankeys[i];
+			if (scankey->sk_attno != 0 && (_bt_compare_one(scankey, scankeyItem->sk_argument, scankeyItem->sk_flags & SK_ISNULL) < 0
+										   || (prioritizeEqual && scankey->sk_strategy == BTEqualStrategyNumber)))
+			{
+				memcpy(skip->curPos.skipScanKey.scankeys + i, scankey, sizeof(ScanKeyData));
+				lastNonTuple = i;
+			}
+			else
+			{
+				if (lastNonTuple < i)
+					break;
+				memcpy(skip->curPos.skipScanKey.scankeys + i, scankeyItem, sizeof(ScanKeyData));
+			}
+			skip->curPos.skipScanKey.scankeys[i].sk_strategy = BTEqualStrategyNumber;
+		}
+	}
+
+	/*
+	 * the remaining keys are the quals after the prefix
+	 */
+	if (ScanDirectionIsForward(curDir))
+		toCopy = &skip->fwdScanKey;
+	else
+		toCopy = &skip->bwdScanKey;
+
+	if (lastNonTuple >= skip->prefix - 1)
+	{
+		left = toCopy->keysz - skip->prefix;
+		if (left > 0)
+		{
+			memcpy(skip->curPos.skipScanKey.scankeys + skip->prefix, toCopy->scankeys + i, sizeof(ScanKeyData) * (unsigned long)left);
+		}
+		skip->curPos.skipScanKey.keysz = toCopy->keysz;
+	}
+	else
+	{
+		skip->curPos.skipScanKey.keysz = lastNonTuple + 1;
+	}
+}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 54c8eb1289..03005f89c9 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -560,7 +560,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index c72b4566de..40da00f72d 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,10 +49,10 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
-static void _bt_mark_scankey_required(ScanKey skey);
+static void _bt_mark_scankey_required(ScanKey skey, int forwardReqFlag, int backwardReqFlag);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
-								 ScanDirection dir, bool *continuescan);
+								 ScanDirection dir, bool *continuescan, int *prefixskipindex);
 static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
 						   IndexTuple firstright, BTScanInsert itup_key);
 
@@ -87,9 +87,8 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, IndexTuple itup, BTScanInsert key)
 {
-	BTScanInsert key;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
 	int			indnkeyatts;
@@ -109,8 +108,10 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	 * Truncated attributes and non-key attributes are omitted from the final
 	 * scan key.
 	 */
-	key = palloc(offsetof(BTScanInsertData, scankeys) +
-				 sizeof(ScanKeyData) * indnkeyatts);
+	if (key == NULL)
+		key = palloc(offsetof(BTScanInsertData, scankeys) +
+					 sizeof(ScanKeyData) * indnkeyatts);
+
 	if (itup)
 		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
 	else
@@ -155,7 +156,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
 									   (AttrNumber) (i + 1),
-									   InvalidStrategy,
+									   BTEqualStrategyNumber,
 									   InvalidOid,
 									   rel->rd_indcollation[i],
 									   procinfo,
@@ -745,7 +746,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	int			numberOfKeys = scan->numberOfKeys;
 	int16	   *indoption = scan->indexRelation->rd_indoption;
 	int			new_numberOfKeys;
-	int			numberOfEqualCols;
+	int			numberOfEqualCols, numberOfEqualColsSincePrefix;
 	ScanKey		inkeys;
 	ScanKey		outkeys;
 	ScanKey		cur;
@@ -754,6 +755,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	int			i,
 				j;
 	AttrNumber	attno;
+	int			prefix = 0;
 
 	/* initialize result variables */
 	so->qual_ok = true;
@@ -762,6 +764,11 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	if (numberOfKeys < 1)
 		return;					/* done if qual-less scan */
 
+	if (_bt_skip_enabled(so))
+	{
+		prefix = so->skipData->prefix;
+	}
+
 	/*
 	 * Read so->arrayKeyData if array keys are present, else scan->keyData
 	 */
@@ -786,7 +793,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		so->numberOfKeys = 1;
 		/* We can mark the qual as required if it's for first index col */
 		if (cur->sk_attno == 1)
-			_bt_mark_scankey_required(outkeys);
+			_bt_mark_scankey_required(outkeys, SK_BT_REQFWD, SK_BT_REQBKWD);
+		if (cur->sk_attno <= prefix + 1)
+			_bt_mark_scankey_required(outkeys, SK_BT_REQSKIPFWD, SK_BT_REQSKIPBKWD);
 		return;
 	}
 
@@ -795,6 +804,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	 */
 	new_numberOfKeys = 0;
 	numberOfEqualCols = 0;
+	numberOfEqualColsSincePrefix = 0;
+
 
 	/*
 	 * Initialize for processing of keys for attr 1.
@@ -830,6 +841,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		if (i == numberOfKeys || cur->sk_attno != attno)
 		{
 			int			priorNumberOfEqualCols = numberOfEqualCols;
+			int			priorNumberOfEqualColsSincePrefix = numberOfEqualColsSincePrefix;
+
 
 			/* check input keys are correctly ordered */
 			if (i < numberOfKeys && cur->sk_attno < attno)
@@ -880,6 +893,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 				}
 				/* track number of attrs for which we have "=" keys */
 				numberOfEqualCols++;
+				if (attno > prefix)
+					numberOfEqualColsSincePrefix++;
 			}
 
 			/* try to keep only one of <, <= */
@@ -929,7 +944,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 
 					memcpy(outkey, xform[j], sizeof(ScanKeyData));
 					if (priorNumberOfEqualCols == attno - 1)
-						_bt_mark_scankey_required(outkey);
+						_bt_mark_scankey_required(outkey, SK_BT_REQFWD, SK_BT_REQBKWD);
+					if (attno <= prefix || priorNumberOfEqualColsSincePrefix == attno - prefix - 1)
+						_bt_mark_scankey_required(outkey, SK_BT_REQSKIPFWD, SK_BT_REQSKIPBKWD);
 				}
 			}
 
@@ -954,7 +971,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 
 			memcpy(outkey, cur, sizeof(ScanKeyData));
 			if (numberOfEqualCols == attno - 1)
-				_bt_mark_scankey_required(outkey);
+				_bt_mark_scankey_required(outkey, SK_BT_REQFWD, SK_BT_REQBKWD);
+			if (attno <= prefix || numberOfEqualColsSincePrefix == attno - prefix - 1)
+				_bt_mark_scankey_required(outkey, SK_BT_REQSKIPFWD, SK_BT_REQSKIPBKWD);
 
 			/*
 			 * We don't support RowCompare using equality; such a qual would
@@ -997,7 +1016,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 
 				memcpy(outkey, cur, sizeof(ScanKeyData));
 				if (numberOfEqualCols == attno - 1)
-					_bt_mark_scankey_required(outkey);
+					_bt_mark_scankey_required(outkey, SK_BT_REQFWD, SK_BT_REQBKWD);
+				if (attno <= prefix || numberOfEqualColsSincePrefix == attno - prefix - 1)
+					_bt_mark_scankey_required(outkey, SK_BT_REQSKIPFWD, SK_BT_REQSKIPBKWD);
 			}
 		}
 	}
@@ -1295,7 +1316,7 @@ _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption)
  * anyway on a rescan.  Something to keep an eye on though.
  */
 static void
-_bt_mark_scankey_required(ScanKey skey)
+_bt_mark_scankey_required(ScanKey skey, int forwardReqFlag, int backwardReqFlag)
 {
 	int			addflags;
 
@@ -1303,14 +1324,14 @@ _bt_mark_scankey_required(ScanKey skey)
 	{
 		case BTLessStrategyNumber:
 		case BTLessEqualStrategyNumber:
-			addflags = SK_BT_REQFWD;
+			addflags = forwardReqFlag;
 			break;
 		case BTEqualStrategyNumber:
-			addflags = SK_BT_REQFWD | SK_BT_REQBKWD;
+			addflags = forwardReqFlag | backwardReqFlag;
 			break;
 		case BTGreaterEqualStrategyNumber:
 		case BTGreaterStrategyNumber:
-			addflags = SK_BT_REQBKWD;
+			addflags = backwardReqFlag;
 			break;
 		default:
 			elog(ERROR, "unrecognized StrategyNumber: %d",
@@ -1353,17 +1374,22 @@ _bt_mark_scankey_required(ScanKey skey)
  */
 bool
 _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan)
+			  ScanDirection dir, bool *continuescan, int *prefixSkipIndex)
 {
 	TupleDesc	tupdesc;
 	BTScanOpaque so;
 	int			keysz;
 	int			ikey;
 	ScanKey		key;
+	int pfx;
+
+	if (prefixSkipIndex == NULL)
+		prefixSkipIndex = &pfx;
 
 	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
 
 	*continuescan = true;		/* default assumption */
+	*prefixSkipIndex = -1;
 
 	tupdesc = RelationGetDescr(scan->indexRelation);
 	so = (BTScanOpaque) scan->opaque;
@@ -1392,7 +1418,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
 			if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
-									 continuescan))
+									 continuescan, prefixSkipIndex))
 				continue;
 			return false;
 		}
@@ -1429,6 +1455,13 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 					 ScanDirectionIsBackward(dir))
 				*continuescan = false;
 
+			if ((key->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+			else if ((key->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1452,6 +1485,10 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
 					*continuescan = false;
+
+				if ((key->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQSKIPBKWD)) &&
+					ScanDirectionIsBackward(dir))
+					*prefixSkipIndex = key->sk_attno - 1;
 			}
 			else
 			{
@@ -1468,6 +1505,9 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
 					*continuescan = false;
+				if ((key->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQSKIPBKWD)) &&
+									ScanDirectionIsBackward(dir))
+									*prefixSkipIndex = key->sk_attno - 1;
 			}
 
 			/*
@@ -1498,6 +1538,13 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 					 ScanDirectionIsBackward(dir))
 				*continuescan = false;
 
+			if ((key->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+			else if ((key->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1509,6 +1556,228 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 	return true;
 }
 
+bool
+_bt_checkkeys_threeway(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+			  ScanDirection dir, bool *continuescan, int *prefixSkipIndex)
+{
+	TupleDesc	tupdesc;
+	BTScanOpaque so;
+	int			keysz;
+	int			ikey;
+	ScanKey		key;
+	int pfx;
+	BTScanInsert keys;
+	bool overallmatch = true;
+
+	if (prefixSkipIndex == NULL)
+		prefixSkipIndex = &pfx;
+
+	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+
+	*continuescan = true;		/* default assumption */
+	*prefixSkipIndex = -1;
+
+	tupdesc = RelationGetDescr(scan->indexRelation);
+	so = (BTScanOpaque) scan->opaque;
+	if (ScanDirectionIsForward(dir))
+		keys = &so->skipData->bwdScanKey;
+	else
+		keys = &so->skipData->fwdScanKey;
+
+	keysz = keys->keysz;
+
+	for (key = keys->scankeys, ikey = 0; ikey < keysz; key++, ikey++)
+	{
+		Datum		datum;
+		bool		isNull;
+		int		cmpresult;
+
+		if (key->sk_attno == 0)
+			continue;
+
+		if (key->sk_attno > tupnatts)
+		{
+			/*
+			 * This attribute is truncated (must be high key).  The value for
+			 * this attribute in the first non-pivot tuple on the page to the
+			 * right could be any possible value.  Assume that truncated
+			 * attribute passes the qual.
+			 */
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
+		/* row-comparison keys need special processing */
+		Assert((key->sk_flags & SK_ROW_HEADER) == 0);
+
+		datum = index_getattr(tuple,
+							  key->sk_attno,
+							  tupdesc,
+							  &isNull);
+
+		if (key->sk_flags & SK_ISNULL)
+		{
+			/* Handle IS NULL/NOT NULL tests */
+			if (key->sk_flags & SK_SEARCHNULL)
+			{
+				if (isNull)
+					continue;	/* tuple satisfies this qual */
+			}
+			else
+			{
+				Assert(key->sk_flags & SK_SEARCHNOTNULL);
+				if (!isNull)
+					continue;	/* tuple satisfies this qual */
+			}
+
+			/*
+			 * Tuple fails this qual.  If it's a required qual for the current
+			 * scan direction, then we can conclude no further tuples will
+			 * pass, either.
+			 */
+			if ((key->sk_flags & SK_BT_REQFWD) &&
+				ScanDirectionIsForward(dir))
+			{
+				*continuescan = false;
+				return false;
+			}
+			else if ((key->sk_flags & SK_BT_REQBKWD) &&
+					 ScanDirectionIsBackward(dir))
+			{
+				*continuescan = false;
+				return false;
+			}
+
+			if ((key->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir))
+			{
+				*prefixSkipIndex = key->sk_attno - 1;
+				return false;
+			}
+			else if ((key->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir))
+			{
+				*prefixSkipIndex = key->sk_attno - 1;
+				return false;
+			}
+
+			overallmatch = false;
+		}
+
+		if (isNull)
+		{
+			if (key->sk_flags & SK_BT_NULLS_FIRST)
+			{
+				/*
+				 * Since NULLs are sorted before non-NULLs, we know we have
+				 * reached the lower limit of the range of values for this
+				 * index attr.  On a backward scan, we can stop if this qual
+				 * is one of the "must match" subset.  We can stop regardless
+				 * of whether the qual is > or <, so long as it's required,
+				 * because it's not possible for any future tuples to pass. On
+				 * a forward scan, however, we must keep going, because we may
+				 * have initially positioned to the start of the index.
+				 */
+				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+					ScanDirectionIsBackward(dir))
+				{
+					*continuescan = false;
+					return false;
+				}
+
+				if ((key->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQSKIPBKWD)) &&
+					ScanDirectionIsBackward(dir))
+				{
+					*prefixSkipIndex = key->sk_attno - 1;
+					return false;
+				}
+			}
+			else
+			{
+				/*
+				 * Since NULLs are sorted after non-NULLs, we know we have
+				 * reached the upper limit of the range of values for this
+				 * index attr.  On a forward scan, we can stop if this qual is
+				 * one of the "must match" subset.  We can stop regardless of
+				 * whether the qual is > or <, so long as it's required,
+				 * because it's not possible for any future tuples to pass. On
+				 * a backward scan, however, we must keep going, because we
+				 * may have initially positioned to the end of the index.
+				 */
+				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+					ScanDirectionIsForward(dir))
+				{
+					*continuescan = false;
+					return false;
+				}
+				if ((key->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQSKIPBKWD)) &&
+					ScanDirectionIsBackward(dir))
+				{
+					*prefixSkipIndex = key->sk_attno - 1;
+					return false;
+				}
+			}
+
+			overallmatch = false;
+		}
+
+		/* Perform the test --- three-way comparison not bool operator */
+		cmpresult = DatumGetInt32(FunctionCall2Coll(&key->sk_func,
+													key->sk_collation,
+													datum,
+													key->sk_argument));
+		if (key->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(cmpresult);
+
+		if (cmpresult != 0)
+		{
+			/*
+			 * Tuple fails this qual.  If it's a required qual for the current
+			 * scan direction, then we can conclude no further tuples will
+			 * pass, either.
+			 *
+			 * Note: because we stop the scan as soon as any required equality
+			 * qual fails, it is critical that equality quals be used for the
+			 * initial positioning in _bt_first() when they are available. See
+			 * comments in _bt_first().
+			 */
+			if ((key->sk_flags & SK_BT_REQFWD) &&
+				ScanDirectionIsForward(dir) && cmpresult > 0)
+			{
+				*continuescan = false;
+				return false;
+			}
+			else if ((key->sk_flags & SK_BT_REQBKWD) &&
+					 ScanDirectionIsBackward(dir) && cmpresult < 0)
+			{
+				*continuescan = false;
+				return false;
+			}
+
+			if ((key->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir) && cmpresult > 0)
+			{
+				*prefixSkipIndex = key->sk_attno - 1;
+				return false;
+			}
+			else if ((key->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir) && cmpresult < 0)
+			{
+				*prefixSkipIndex = key->sk_attno - 1;
+				return false;
+			}
+
+			/*
+			 * In any case, this indextuple doesn't match the qual.
+			 */
+			overallmatch = false;
+		}
+	}
+
+	/* If we get here, the tuple passes all index quals. */
+	return overallmatch;
+}
+
 /*
  * Test whether an indextuple satisfies a row-comparison scan condition.
  *
@@ -1520,7 +1789,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
-					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
+					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan, int *prefixSkipIndex)
 {
 	ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
 	int32		cmpresult = 0;
@@ -1576,6 +1845,10 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
 					*continuescan = false;
+
+				if ((subkey->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQBKWD) &&
+					ScanDirectionIsBackward(dir)))
+					*prefixSkipIndex = subkey->sk_attno - 1;
 			}
 			else
 			{
@@ -1592,6 +1865,10 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
 					*continuescan = false;
+
+				if ((subkey->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQBKWD) &&
+					ScanDirectionIsForward(dir)))
+					*prefixSkipIndex = subkey->sk_attno - 1;
 			}
 
 			/*
@@ -1616,6 +1893,13 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
 					 ScanDirectionIsBackward(dir))
 				*continuescan = false;
+
+			if ((subkey->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir))
+				*prefixSkipIndex = subkey->sk_attno - 1;
+			else if ((subkey->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir))
+				*prefixSkipIndex = subkey->sk_attno - 1;
 			return false;
 		}
 
@@ -1678,6 +1962,13 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 		else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
 				 ScanDirectionIsBackward(dir))
 			*continuescan = false;
+
+		if ((subkey->sk_flags & SK_BT_REQSKIPFWD) &&
+			ScanDirectionIsForward(dir))
+			*prefixSkipIndex = subkey->sk_attno - 1;
+		else if ((subkey->sk_flags & SK_BT_REQSKIPBKWD) &&
+				 ScanDirectionIsBackward(dir))
+			*prefixSkipIndex = subkey->sk_attno - 1;
 	}
 
 	return result;
@@ -2733,3 +3024,524 @@ _bt_allequalimage(Relation rel, bool debugmessage)
 
 	return allequalimage;
 }
+
+void _bt_set_bsearch_flags(StrategyNumber stratTotal, ScanDirection dir, bool* nextkey, bool* goback)
+{
+	/*----------
+	 * Examine the selected initial-positioning strategy to determine exactly
+	 * where we need to start the scan, and set flag variables to control the
+	 * code below.
+	 *
+	 * If nextkey = false, _bt_search and _bt_binsrch will locate the first
+	 * item >= scan key.  If nextkey = true, they will locate the first
+	 * item > scan key.
+	 *
+	 * If goback = true, we will then step back one item, while if
+	 * goback = false, we will start the scan on the located item.
+	 *----------
+	 */
+	switch (stratTotal)
+	{
+		case BTLessStrategyNumber:
+
+			/*
+			 * Find first item >= scankey, then back up one to arrive at last
+			 * item < scankey.  (Note: this positioning strategy is only used
+			 * for a backward scan, so that is always the correct starting
+			 * position.)
+			 */
+			*nextkey = false;
+			*goback = true;
+			break;
+
+		case BTLessEqualStrategyNumber:
+
+			/*
+			 * Find first item > scankey, then back up one to arrive at last
+			 * item <= scankey.  (Note: this positioning strategy is only used
+			 * for a backward scan, so that is always the correct starting
+			 * position.)
+			 */
+			*nextkey = true;
+			*goback = true;
+			break;
+
+		case BTEqualStrategyNumber:
+
+			/*
+			 * If a backward scan was specified, need to start with last equal
+			 * item not first one.
+			 */
+			if (ScanDirectionIsBackward(dir))
+			{
+				/*
+				 * This is the same as the <= strategy.  We will check at the
+				 * end whether the found item is actually =.
+				 */
+				*nextkey = true;
+				*goback = true;
+			}
+			else
+			{
+				/*
+				 * This is the same as the >= strategy.  We will check at the
+				 * end whether the found item is actually =.
+				 */
+				*nextkey = false;
+				*goback = false;
+			}
+			break;
+
+		case BTGreaterEqualStrategyNumber:
+
+			/*
+			 * Find first item >= scankey.  (This is only used for forward
+			 * scans.)
+			 */
+			*nextkey = false;
+			*goback = false;
+			break;
+
+		case BTGreaterStrategyNumber:
+
+			/*
+			 * Find first item > scankey.  (This is only used for forward
+			 * scans.)
+			 */
+			*nextkey = true;
+			*goback = false;
+			break;
+
+		default:
+			/* can't get here, but keep compiler quiet */
+			elog(ERROR, "unrecognized strat_total: %d", (int) stratTotal);
+	}
+}
+
+bool _bt_create_insertion_scan_key(Relation	rel, ScanDirection dir, ScanKey* startKeys, int keysCount, BTScanInsert inskey, StrategyNumber* stratTotal,  bool* goback)
+{
+	int i;
+	bool nextkey;
+
+	/*
+	 * We want to start the scan somewhere within the index.  Set up an
+	 * insertion scankey we can use to search for the boundary point we
+	 * identified above.  The insertion scankey is built using the keys
+	 * identified by startKeys[].  (Remaining insertion scankey fields are
+	 * initialized after initial-positioning strategy is finalized.)
+	 */
+	Assert(keysCount <= INDEX_MAX_KEYS);
+	for (i = 0; i < keysCount; i++)
+	{
+		ScanKey		cur = startKeys[i];
+
+		if (cur == NULL)
+		{
+			inskey->scankeys[i].sk_attno = 0;
+			continue;
+		}
+
+		Assert(cur->sk_attno == i + 1);
+
+		if (cur->sk_flags & SK_ROW_HEADER)
+		{
+			/*
+			 * Row comparison header: look to the first row member instead.
+			 *
+			 * The member scankeys are already in insertion format (ie, they
+			 * have sk_func = 3-way-comparison function), but we have to watch
+			 * out for nulls, which _bt_preprocess_keys didn't check. A null
+			 * in the first row member makes the condition unmatchable, just
+			 * like qual_ok = false.
+			 */
+			ScanKey		subkey = (ScanKey) DatumGetPointer(cur->sk_argument);
+
+			Assert(subkey->sk_flags & SK_ROW_MEMBER);
+			if (subkey->sk_flags & SK_ISNULL)
+			{
+				return false;
+			}
+			memcpy(inskey->scankeys + i, subkey, sizeof(ScanKeyData));
+
+			/*
+			 * If the row comparison is the last positioning key we accepted,
+			 * try to add additional keys from the lower-order row members.
+			 * (If we accepted independent conditions on additional index
+			 * columns, we use those instead --- doesn't seem worth trying to
+			 * determine which is more restrictive.)  Note that this is OK
+			 * even if the row comparison is of ">" or "<" type, because the
+			 * condition applied to all but the last row member is effectively
+			 * ">=" or "<=", and so the extra keys don't break the positioning
+			 * scheme.  But, by the same token, if we aren't able to use all
+			 * the row members, then the part of the row comparison that we
+			 * did use has to be treated as just a ">=" or "<=" condition, and
+			 * so we'd better adjust strat_total accordingly.
+			 */
+			if (i == keysCount - 1)
+			{
+				bool		used_all_subkeys = false;
+
+				Assert(!(subkey->sk_flags & SK_ROW_END));
+				for (;;)
+				{
+					subkey++;
+					Assert(subkey->sk_flags & SK_ROW_MEMBER);
+					if (subkey->sk_attno != keysCount + 1)
+						break;	/* out-of-sequence, can't use it */
+					if (subkey->sk_strategy != cur->sk_strategy)
+						break;	/* wrong direction, can't use it */
+					if (subkey->sk_flags & SK_ISNULL)
+						break;	/* can't use null keys */
+					Assert(keysCount < INDEX_MAX_KEYS);
+					memcpy(inskey->scankeys + keysCount, subkey,
+						   sizeof(ScanKeyData));
+					keysCount++;
+					if (subkey->sk_flags & SK_ROW_END)
+					{
+						used_all_subkeys = true;
+						break;
+					}
+				}
+				if (!used_all_subkeys)
+				{
+					switch (*stratTotal)
+					{
+						case BTLessStrategyNumber:
+							*stratTotal = BTLessEqualStrategyNumber;
+							break;
+						case BTGreaterStrategyNumber:
+							*stratTotal = BTGreaterEqualStrategyNumber;
+							break;
+					}
+				}
+				break;			/* done with outer loop */
+			}
+		}
+		else
+		{
+			/*
+			 * Ordinary comparison key.  Transform the search-style scan key
+			 * to an insertion scan key by replacing the sk_func with the
+			 * appropriate btree comparison function.
+			 *
+			 * If scankey operator is not a cross-type comparison, we can use
+			 * the cached comparison function; otherwise gotta look it up in
+			 * the catalogs.  (That can't lead to infinite recursion, since no
+			 * indexscan initiated by syscache lookup will use cross-data-type
+			 * operators.)
+			 *
+			 * We support the convention that sk_subtype == InvalidOid means
+			 * the opclass input type; this is a hack to simplify life for
+			 * ScanKeyInit().
+			 */
+			if (cur->sk_subtype == rel->rd_opcintype[i] ||
+				cur->sk_subtype == InvalidOid)
+			{
+				FmgrInfo   *procinfo;
+
+				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
+				ScanKeyEntryInitializeWithInfo(inskey->scankeys + i,
+											   cur->sk_flags,
+											   cur->sk_attno,
+											   cur->sk_strategy,
+											   cur->sk_subtype,
+											   cur->sk_collation,
+											   procinfo,
+											   cur->sk_argument);
+			}
+			else
+			{
+				RegProcedure cmp_proc;
+
+				cmp_proc = get_opfamily_proc(rel->rd_opfamily[i],
+											 rel->rd_opcintype[i],
+											 cur->sk_subtype,
+											 BTORDER_PROC);
+				if (!RegProcedureIsValid(cmp_proc))
+					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
+						 cur->sk_attno, RelationGetRelationName(rel));
+				ScanKeyEntryInitialize(inskey->scankeys + i,
+									   cur->sk_flags,
+									   cur->sk_attno,
+									   cur->sk_strategy,
+									   cur->sk_subtype,
+									   cur->sk_collation,
+									   cmp_proc,
+									   cur->sk_argument);
+			}
+		}
+	}
+
+	_bt_set_bsearch_flags(*stratTotal, dir, &nextkey, goback);
+
+	/* Initialize remaining insertion scan key fields */
+	_bt_metaversion(rel, &inskey->heapkeyspace, &inskey->allequalimage);
+	inskey->anynullkeys = false; /* unused */
+	inskey->nextkey = nextkey;
+	inskey->pivotsearch = false;
+	inskey->scantid = NULL;
+	inskey->keysz = keysCount;
+
+	return true;
+}
+
+/*----------
+ * Examine the scan keys to discover where we need to start the scan.
+ *
+ * We want to identify the keys that can be used as starting boundaries;
+ * these are =, >, or >= keys for a forward scan or =, <, <= keys for
+ * a backwards scan.  We can use keys for multiple attributes so long as
+ * the prior attributes had only =, >= (resp. =, <=) keys.  Once we accept
+ * a > or < boundary or find an attribute with no boundary (which can be
+ * thought of as the same as "> -infinity"), we can't use keys for any
+ * attributes to its right, because it would break our simplistic notion
+ * of what initial positioning strategy to use.
+ *
+ * When the scan keys include cross-type operators, _bt_preprocess_keys
+ * may not be able to eliminate redundant keys; in such cases we will
+ * arbitrarily pick a usable one for each attribute.  This is correct
+ * but possibly not optimal behavior.  (For example, with keys like
+ * "x >= 4 AND x >= 5" we would elect to scan starting at x=4 when
+ * x=5 would be more efficient.)  Since the situation only arises given
+ * a poorly-worded query plus an incomplete opfamily, live with it.
+ *
+ * When both equality and inequality keys appear for a single attribute
+ * (again, only possible when cross-type operators appear), we *must*
+ * select one of the equality keys for the starting point, because
+ * _bt_checkkeys() will stop the scan as soon as an equality qual fails.
+ * For example, if we have keys like "x >= 4 AND x = 10" and we elect to
+ * start at x=4, we will fail and stop before reaching x=10.  If multiple
+ * equality quals survive preprocessing, however, it doesn't matter which
+ * one we use --- by definition, they are either redundant or
+ * contradictory.
+ *
+ * Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
+ * If the index stores nulls at the end of the index we'll be starting
+ * from, and we have no boundary key for the column (which means the key
+ * we deduced NOT NULL from is an inequality key that constrains the other
+ * end of the index), then we cons up an explicit SK_SEARCHNOTNULL key to
+ * use as a boundary key.  If we didn't do this, we might find ourselves
+ * traversing a lot of null entries at the start of the scan.
+ *
+ * In this loop, row-comparison keys are treated the same as keys on their
+ * first (leftmost) columns.  We'll add on lower-order columns of the row
+ * comparison below, if possible.
+ *
+ * The selected scan keys (at most one per index column) are remembered by
+ * storing their addresses into the local startKeys[] array.
+ *----------
+ */
+int _bt_choose_scan_keys(ScanKey scanKeys, int numberOfKeys, ScanDirection dir, ScanKey* startKeys, ScanKeyData* notnullkeys, StrategyNumber* stratTotal, int prefix)
+{
+	StrategyNumber strat;
+	int			keysCount = 0;
+	int			i;
+
+	*stratTotal = BTEqualStrategyNumber;
+	if (numberOfKeys > 0 || prefix > 0)
+	{
+		AttrNumber	curattr;
+		ScanKey		chosen;
+		ScanKey		impliesNN;
+		ScanKey		cur;
+
+		/*
+		 * chosen is the so-far-chosen key for the current attribute, if any.
+		 * We don't cast the decision in stone until we reach keys for the
+		 * next attribute.
+		 */
+		curattr = 1;
+		chosen = NULL;
+		/* Also remember any scankey that implies a NOT NULL constraint */
+		impliesNN = NULL;
+
+		/*
+		 * Loop iterates from 0 to numberOfKeys inclusive; we use the last
+		 * pass to handle after-last-key processing.  Actual exit from the
+		 * loop is at one of the "break" statements below.
+		 */
+		for (cur = scanKeys, i = 0;; cur++, i++)
+		{
+			if (i >= numberOfKeys || cur->sk_attno != curattr)
+			{
+				/*
+				 * Done looking at keys for curattr.  If we didn't find a
+				 * usable boundary key, see if we can deduce a NOT NULL key.
+				 */
+				if (chosen == NULL && impliesNN != NULL &&
+					((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
+					 ScanDirectionIsForward(dir) :
+					 ScanDirectionIsBackward(dir)))
+				{
+					/* Yes, so build the key in notnullkeys[keysCount] */
+					chosen = &notnullkeys[keysCount];
+					ScanKeyEntryInitialize(chosen,
+										   (SK_SEARCHNOTNULL | SK_ISNULL |
+											(impliesNN->sk_flags &
+											 (SK_BT_DESC | SK_BT_NULLS_FIRST))),
+										   curattr,
+										   ((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
+											BTGreaterStrategyNumber :
+											BTLessStrategyNumber),
+										   InvalidOid,
+										   InvalidOid,
+										   InvalidOid,
+										   (Datum) 0);
+				}
+
+				/*
+				 * If we still didn't find a usable boundary key, quit; else
+				 * save the boundary key pointer in startKeys.
+				 */
+				if (chosen == NULL && curattr > prefix)
+					break;
+				startKeys[keysCount++] = chosen;
+
+				/*
+				 * Adjust strat_total, and quit if we have stored a > or <
+				 * key.
+				 */
+				if (chosen != NULL && curattr > prefix)
+				{
+					strat = chosen->sk_strategy;
+					if (strat != BTEqualStrategyNumber)
+					{
+						*stratTotal = strat;
+						if (strat == BTGreaterStrategyNumber ||
+							strat == BTLessStrategyNumber)
+							break;
+					}
+				}
+
+				/*
+				 * Done if that was the last attribute, or if next key is not
+				 * in sequence (implying no boundary key is available for the
+				 * next attribute).
+				 */
+				if (i >= numberOfKeys)
+				{
+					curattr++;
+					while(curattr <= prefix)
+					{
+						startKeys[keysCount++] = NULL;
+						curattr++;
+					}
+					break;
+				}
+				else if (cur->sk_attno != curattr + 1)
+				{
+					curattr++;
+					while(curattr < cur->sk_attno && curattr <= prefix)
+					{
+						startKeys[keysCount++] = NULL;
+						curattr++;
+					}
+					if (curattr > prefix && curattr != cur->sk_attno)
+						break;
+				}
+				else
+				{
+					curattr++;
+				}
+
+				/*
+				 * Reset for next attr.
+				 */
+				chosen = NULL;
+				impliesNN = NULL;
+			}
+
+			/*
+			 * Can we use this key as a starting boundary for this attr?
+			 *
+			 * If not, does it imply a NOT NULL constraint?  (Because
+			 * SK_SEARCHNULL keys are always assigned BTEqualStrategyNumber,
+			 * *any* inequality key works for that; we need not test.)
+			 */
+			switch (cur->sk_strategy)
+			{
+				case BTLessStrategyNumber:
+				case BTLessEqualStrategyNumber:
+					if (chosen == NULL)
+					{
+						if (ScanDirectionIsBackward(dir))
+							chosen = cur;
+						else
+							impliesNN = cur;
+					}
+					break;
+				case BTEqualStrategyNumber:
+					/* override any non-equality choice */
+					chosen = cur;
+					break;
+				case BTGreaterEqualStrategyNumber:
+				case BTGreaterStrategyNumber:
+					if (chosen == NULL)
+					{
+						if (ScanDirectionIsForward(dir))
+							chosen = cur;
+						else
+							impliesNN = cur;
+					}
+					break;
+			}
+		}
+	}
+	return keysCount;
+}
+
+void print_itup(BlockNumber blk, IndexTuple left, IndexTuple right, Relation rel, char *extra)
+{
+	bool		isnull[INDEX_MAX_KEYS];
+	Datum		values[INDEX_MAX_KEYS];
+	char	   *lkey_desc = NULL;
+	char	   *rkey_desc;
+
+	/* Avoid infinite recursion -- don't instrument catalog indexes */
+	if (!IsCatalogRelation(rel))
+	{
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(left, rel);
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(left, itupdesc, values, isnull);
+		rel->rd_index->indnkeyatts = natts;
+
+		/*
+		 * Since the regression tests should pass when the instrumentation
+		 * patch is applied, be prepared for BuildIndexValueDescription() to
+		 * return NULL due to security considerations.
+		 */
+		lkey_desc = BuildIndexValueDescription(rel, values, isnull);
+		if (lkey_desc && right)
+		{
+			/*
+			 * Revolting hack: modify tuple descriptor to have number of key
+			 * columns actually present in caller's pivot tuples
+			 */
+			natts = BTreeTupleGetNAtts(right, rel);
+			itupdesc->natts = Min(indnkeyatts, natts);
+			memset(&isnull, 0xFF, sizeof(isnull));
+			index_deform_tuple(right, itupdesc, values, isnull);
+			rel->rd_index->indnkeyatts = natts;
+			rkey_desc = BuildIndexValueDescription(rel, values, isnull);
+			elog(DEBUG1, "%s blk %u sk > %s, sk <= %s %s",
+				 RelationGetRelationName(rel), blk, lkey_desc, rkey_desc,
+				 extra);
+			pfree(rkey_desc);
+		}
+		else
+			elog(DEBUG1, "%s blk %u sk check %s %s",
+				 RelationGetRelationName(rel), blk, lkey_desc, extra);
+
+		/* Cleanup */
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
+		if (lkey_desc)
+			pfree(lkey_desc);
+	}
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 03a9cd36e6..fde43f8131 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -71,6 +71,9 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = spgcostestimate;
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 10644dfac4..ef8e89d259 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -152,6 +152,7 @@ static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
 static void ExplainIndentText(ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 
 
@@ -1114,6 +1115,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
 }
 
+/*
+ * ExplainIndexSkipScanKeys -
+ *	  Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es)
+{
+	if (skipPrefixSize > 0)
+	{
+		if (es->format != EXPLAIN_FORMAT_TEXT)
+			ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+	}
+}
+
 /*
  * ExplainNode -
  *	  Appends a description of a plan tree to es->str
@@ -1461,6 +1478,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexScan  *indexscan = (IndexScan *) plan;
 
+				if (indexscan->indexdistinct)
+					ExplainIndexSkipScanKeys(indexscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexscan->indexid,
 										indexscan->indexorderdir,
 										es);
@@ -1471,6 +1491,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
 
+				if (indexonlyscan->indexdistinct)
+					ExplainIndexSkipScanKeys(indexonlyscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexonlyscan->indexid,
 										indexonlyscan->indexorderdir,
 										es);
@@ -1731,6 +1754,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_IndexScan:
+			if (((IndexScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyText("Skip scan", ((IndexScan *) plan)->indexdistinct ? "Distinct only" : "All", es);
 			show_scan_qual(((IndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexScan *) plan)->indexqualorig)
@@ -1744,6 +1769,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 										   planstate, es);
 			break;
 		case T_IndexOnlyScan:
+			if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyText("Skip scan", ((IndexOnlyScan *) plan)->indexdistinct ? "Distinct only" : "All", es);
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexOnlyScan *) plan)->indexqual)
@@ -1760,6 +1787,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 									 planstate->instrument->ntuples2, 0, es);
 			break;
 		case T_BitmapIndexScan:
+			if (((BitmapIndexScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyText("Skip scan", "All", es);
 			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
 			break;
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 9f1d8b6d1e..3c1a79a809 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -133,6 +133,14 @@ ExecScanFetch(ScanState *node,
 	return (*accessMtd) (node);
 }
 
+TupleTableSlot *
+ExecScan(ScanState *node,
+		 ExecScanAccessMtd accessMtd,	/* function returning a tuple */
+		 ExecScanRecheckMtd recheckMtd)
+{
+	return ExecScanExtended(node, accessMtd, recheckMtd, NULL);
+}
+
 /* ----------------------------------------------------------------
  *		ExecScan
  *
@@ -155,9 +163,10 @@ ExecScanFetch(ScanState *node,
  * ----------------------------------------------------------------
  */
 TupleTableSlot *
-ExecScan(ScanState *node,
+ExecScanExtended(ScanState *node,
 		 ExecScanAccessMtd accessMtd,	/* function returning a tuple */
-		 ExecScanRecheckMtd recheckMtd)
+		 ExecScanRecheckMtd recheckMtd,
+		 ExecScanSkipMtd skipMtd)
 {
 	ExprContext *econtext;
 	ExprState  *qual;
@@ -170,6 +179,20 @@ ExecScan(ScanState *node,
 	projInfo = node->ps.ps_ProjInfo;
 	econtext = node->ps.ps_ExprContext;
 
+	if (skipMtd != NULL && node->ss_FirstTupleEmitted)
+	{
+		bool cont = skipMtd(node);
+		if (!cont)
+		{
+			node->ss_FirstTupleEmitted = false;
+			return ExecClearTuple(node->ss_ScanTupleSlot);
+		}
+	}
+	else
+	{
+		node->ss_FirstTupleEmitted = true;
+	}
+
 	/* interrupt checks are in ExecScanFetch */
 
 	/*
@@ -178,8 +201,13 @@ ExecScan(ScanState *node,
 	 */
 	if (!qual && !projInfo)
 	{
+		TupleTableSlot *slot;
+
 		ResetExprContext(econtext);
-		return ExecScanFetch(node, accessMtd, recheckMtd);
+		slot = ExecScanFetch(node, accessMtd, recheckMtd);
+		if (TupIsNull(slot))
+			node->ss_FirstTupleEmitted = false;
+		return slot;
 	}
 
 	/*
@@ -206,6 +234,7 @@ ExecScan(ScanState *node,
 		 */
 		if (TupIsNull(slot))
 		{
+			node->ss_FirstTupleEmitted = false;
 			if (projInfo)
 				return ExecClearTuple(projInfo->pi_state.resultslot);
 			else
@@ -306,6 +335,8 @@ ExecScanReScan(ScanState *node)
 	 */
 	ExecClearTuple(node->ss_ScanTupleSlot);
 
+	node->ss_FirstTupleEmitted = false;
+
 	/* Rescan EvalPlanQual tuple if we're inside an EvalPlanQual recheck */
 	if (estate->es_epq_active != NULL)
 	{
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 48c2036297..2b05970ec3 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -22,13 +22,14 @@
 #include "postgres.h"
 
 #include "access/genam.h"
+#include "access/relscan.h"
 #include "executor/execdebug.h"
 #include "executor/nodeBitmapIndexscan.h"
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
+#include "utils/rel.h"
 #include "utils/memutils.h"
 
-
 /* ----------------------------------------------------------------
  *		ExecBitmapIndexScan
  *
@@ -223,6 +224,7 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecBitmapIndexScan;
+	indexstate->ss.ss_FirstTupleEmitted = false;
 
 	/* normally we don't make the result bitmap till runtime */
 	indexstate->biss_result = NULL;
@@ -308,10 +310,20 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
 	/*
 	 * Initialize scan descriptor.
 	 */
-	indexstate->biss_ScanDesc =
-		index_beginscan_bitmap(indexstate->biss_RelationDesc,
-							   estate->es_snapshot,
-							   indexstate->biss_NumScanKeys);
+	if (node->indexskipprefixsize > 0)
+	{
+		indexstate->biss_ScanDesc =
+			index_beginscan_bitmap_skip(indexstate->biss_RelationDesc,
+				estate->es_snapshot,
+				indexstate->biss_NumScanKeys,
+				Min(IndexRelationGetNumberOfKeyAttributes(indexstate->biss_RelationDesc),
+					node->indexskipprefixsize));
+	}
+	else
+		indexstate->biss_ScanDesc =
+			index_beginscan_bitmap(indexstate->biss_RelationDesc,
+								   estate->es_snapshot,
+								   indexstate->biss_NumScanKeys);
 
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 0754e28a9a..7b13ec8a87 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -41,6 +41,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/itemptr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -49,6 +50,37 @@ static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup,
 							TupleDesc itupdesc);
 
+static bool
+IndexOnlySkip(IndexOnlyScanState *node)
+{
+	EState	   *estate;
+	ScanDirection direction;
+	IndexScanDesc scandesc;
+	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
+
+	if (!node->ioss_Distinct)
+		return true;
+
+	/*
+	 * extract necessary information from index scan node
+	 */
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	/* flip direction if this is an overall backward scan */
+	if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
+	{
+		if (ScanDirectionIsForward(direction))
+			direction = BackwardScanDirection;
+		else if (ScanDirectionIsBackward(direction))
+			direction = ForwardScanDirection;
+	}
+	scandesc = node->ioss_ScanDesc;
+
+	if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir))
+		return false;
+
+	return true;
+}
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -65,6 +97,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	ItemPointerData startTid;
+	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
 
 	/*
 	 * extract necessary information from index scan node
@@ -72,7 +106,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -90,11 +124,19 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * serially executing an index only scan that was planned to be
 		 * parallel.
 		 */
-		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->ioss_RelationDesc,
-								   estate->es_snapshot,
-								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+		if (node->ioss_SkipPrefixSize > 0)
+			scandesc = index_beginscan_skip(node->ss.ss_currentRelation,
+									   node->ioss_RelationDesc,
+									   estate->es_snapshot,
+									   node->ioss_NumScanKeys,
+									   node->ioss_NumOrderByKeys,
+									   Min(IndexRelationGetNumberOfKeyAttributes(node->ioss_RelationDesc), node->ioss_SkipPrefixSize));
+		else
+			scandesc = index_beginscan(node->ss.ss_currentRelation,
+									   node->ioss_RelationDesc,
+									   estate->es_snapshot,
+									   node->ioss_NumScanKeys,
+									   node->ioss_NumOrderByKeys);
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -114,11 +156,16 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
 	}
+	else
+	{
+		ItemPointerCopy(&scandesc->xs_heaptid, &startTid);
+	}
 
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while ((tid = node->ioss_SkipPrefixSize > 0 ? index_getnext_tid_skip(scandesc, direction, node->ioss_Distinct ? indexonlyscan->indexorderdir : direction) :
+			index_getnext_tid(scandesc, direction)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
@@ -314,9 +361,10 @@ ExecIndexOnlyScan(PlanState *pstate)
 	if (node->ioss_NumRuntimeKeys != 0 && !node->ioss_RuntimeKeysReady)
 		ExecReScan((PlanState *) node);
 
-	return ExecScan(&node->ss,
+	return ExecScanExtended(&node->ss,
 					(ExecScanAccessMtd) IndexOnlyNext,
-					(ExecScanRecheckMtd) IndexOnlyRecheck);
+					(ExecScanRecheckMtd) IndexOnlyRecheck,
+					node->ioss_Distinct ? (ExecScanSkipMtd) IndexOnlySkip : NULL);
 }
 
 /* ----------------------------------------------------------------
@@ -504,6 +552,9 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+	indexstate->ss.ss_FirstTupleEmitted = false;
+	indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->ioss_Distinct = node->indexdistinct;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 2fffb1b437..c6b6e7a6fb 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -69,6 +69,37 @@ static void reorderqueue_push(IndexScanState *node, TupleTableSlot *slot,
 							  Datum *orderbyvals, bool *orderbynulls);
 static HeapTuple reorderqueue_pop(IndexScanState *node);
 
+static bool
+IndexSkip(IndexScanState *node)
+{
+	EState	   *estate;
+	ScanDirection direction;
+	IndexScanDesc scandesc;
+	IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
+
+	if (!node->iss_Distinct)
+		return true;
+
+	/*
+	 * extract necessary information from index scan node
+	 */
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	/* flip direction if this is an overall backward scan */
+	if (ScanDirectionIsBackward(indexscan->indexorderdir))
+	{
+		if (ScanDirectionIsForward(direction))
+			direction = BackwardScanDirection;
+		else if (ScanDirectionIsBackward(direction))
+			direction = ForwardScanDirection;
+	}
+	scandesc = node->iss_ScanDesc;
+
+	if (!index_skip(scandesc, direction, indexscan->indexorderdir))
+		return false;
+
+	return true;
+}
 
 /* ----------------------------------------------------------------
  *		IndexNext
@@ -85,6 +116,7 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
 
 	/*
 	 * extract necessary information from index scan node
@@ -92,7 +124,7 @@ IndexNext(IndexScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -109,14 +141,25 @@ IndexNext(IndexScanState *node)
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
 		 */
-		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
-								   estate->es_snapshot,
-								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+		if (node->iss_SkipPrefixSize > 0)
+			scandesc = index_beginscan_skip(node->ss.ss_currentRelation,
+									   node->iss_RelationDesc,
+									   estate->es_snapshot,
+									   node->iss_NumScanKeys,
+									   node->iss_NumOrderByKeys,
+									   Min(IndexRelationGetNumberOfKeyAttributes(node->iss_RelationDesc), node->iss_SkipPrefixSize));
+		else
+			scandesc = index_beginscan(node->ss.ss_currentRelation,
+									   node->iss_RelationDesc,
+									   estate->es_snapshot,
+									   node->iss_NumScanKeys,
+									   node->iss_NumOrderByKeys);
 
 		node->iss_ScanDesc = scandesc;
 
+		/* Index skip scan assumes xs_want_itup, so set it to true if we skip over distinct */
+		node->iss_ScanDesc->xs_want_itup = indexscan->indexdistinct;
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -130,7 +173,9 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (node->iss_SkipPrefixSize > 0 ?
+		   index_getnext_slot_skip(scandesc, direction, node->iss_Distinct ? indexscan->indexorderdir : direction, slot) :
+		   index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -530,13 +575,15 @@ ExecIndexScan(PlanState *pstate)
 		ExecReScan((PlanState *) node);
 
 	if (node->iss_NumOrderByKeys > 0)
-		return ExecScan(&node->ss,
+		return ExecScanExtended(&node->ss,
 						(ExecScanAccessMtd) IndexNextWithReorder,
-						(ExecScanRecheckMtd) IndexRecheck);
+						(ExecScanRecheckMtd) IndexRecheck,
+						node->iss_Distinct ? (ExecScanSkipMtd) IndexSkip : NULL);
 	else
-		return ExecScan(&node->ss,
+		return ExecScanExtended(&node->ss,
 						(ExecScanAccessMtd) IndexNext,
-						(ExecScanRecheckMtd) IndexRecheck);
+						(ExecScanRecheckMtd) IndexRecheck,
+						node->iss_Distinct ? (ExecScanSkipMtd) IndexSkip : NULL);
 }
 
 /* ----------------------------------------------------------------
@@ -910,6 +957,9 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+	indexstate->ss.ss_FirstTupleEmitted = false;
+	indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->iss_Distinct = node->indexdistinct;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 652ba7f8ee..268e799e82 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -497,6 +497,8 @@ _copyIndexScan(const IndexScan *from)
 	COPY_NODE_FIELD(indexorderbyorig);
 	COPY_NODE_FIELD(indexorderbyops);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
+	COPY_SCALAR_FIELD(indexdistinct);
 
 	return newnode;
 }
@@ -522,6 +524,8 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
 	COPY_NODE_FIELD(indexorderby);
 	COPY_NODE_FIELD(indextlist);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
+	COPY_SCALAR_FIELD(indexdistinct);
 
 	return newnode;
 }
@@ -546,6 +550,7 @@ _copyBitmapIndexScan(const BitmapIndexScan *from)
 	COPY_SCALAR_FIELD(isshared);
 	COPY_NODE_FIELD(indexqual);
 	COPY_NODE_FIELD(indexqualorig);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 5a25a50edc..be6b359aa3 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -569,6 +569,8 @@ _outIndexScan(StringInfo str, const IndexScan *node)
 	WRITE_NODE_FIELD(indexorderbyorig);
 	WRITE_NODE_FIELD(indexorderbyops);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
+	WRITE_INT_FIELD(indexdistinct);
 }
 
 static void
@@ -583,6 +585,9 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
 	WRITE_NODE_FIELD(indexorderby);
 	WRITE_NODE_FIELD(indextlist);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
+	WRITE_INT_FIELD(indexdistinct);
+
 }
 
 static void
@@ -596,6 +601,7 @@ _outBitmapIndexScan(StringInfo str, const BitmapIndexScan *node)
 	WRITE_BOOL_FIELD(isshared);
 	WRITE_NODE_FIELD(indexqual);
 	WRITE_NODE_FIELD(indexqualorig);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 54d97ac3d0..b95f217475 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1878,6 +1878,8 @@ _readIndexScan(void)
 	READ_NODE_FIELD(indexorderbyorig);
 	READ_NODE_FIELD(indexorderbyops);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
+	READ_INT_FIELD(indexdistinct);
 
 	READ_DONE();
 }
@@ -1897,6 +1899,8 @@ _readIndexOnlyScan(void)
 	READ_NODE_FIELD(indexorderby);
 	READ_NODE_FIELD(indextlist);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
+	READ_INT_FIELD(indexdistinct);
 
 	READ_DONE();
 }
@@ -1915,6 +1919,7 @@ _readBitmapIndexScan(void)
 	READ_BOOL_FIELD(isshared);
 	READ_NODE_FIELD(indexqual);
 	READ_NODE_FIELD(indexqualorig);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 4dfc5d29bc..9cac47a9ca 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1755,7 +1755,9 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *pathkeys = (List *) lfirst(lcp);
 		List	   *startup_subpaths = NIL;
 		List	   *total_subpaths = NIL;
+		List	   *uniq_total_subpaths = NIL;
 		bool		startup_neq_total = false;
+		bool		uniq_neq_total = false;
 		ListCell   *lcr;
 		bool		match_partition_order;
 		bool		match_partition_order_desc;
@@ -1784,7 +1786,8 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		{
 			RelOptInfo *childrel = (RelOptInfo *) lfirst(lcr);
 			Path	   *cheapest_startup,
-					   *cheapest_total;
+					   *cheapest_total,
+						*cheapest_uniq_total = NULL;
 
 			/* Locate the right paths, if they are available. */
 			cheapest_startup =
@@ -1800,6 +1803,19 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 											   TOTAL_COST,
 											   false);
 
+			cheapest_uniq_total =
+				get_cheapest_path_for_pathkeys(childrel->unique_pathlist,
+											   pathkeys,
+											   NULL,
+											   TOTAL_COST,
+											   false);
+
+			if (cheapest_uniq_total != NULL && !uniq_neq_total)
+			{
+				uniq_neq_total = true;
+				uniq_total_subpaths = list_copy(total_subpaths);
+			}
+
 			/*
 			 * If we can't find any paths with the right order just use the
 			 * cheapest-total path; we'll have to sort it later.
@@ -1812,6 +1828,9 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 				Assert(cheapest_total->param_info == NULL);
 			}
 
+			if (cheapest_uniq_total == NULL)
+				cheapest_uniq_total = cheapest_total;
+
 			/*
 			 * Notice whether we actually have different paths for the
 			 * "cheapest" and "total" cases; frequently there will be no point
@@ -1838,6 +1857,12 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 
 				startup_subpaths = lappend(startup_subpaths, cheapest_startup);
 				total_subpaths = lappend(total_subpaths, cheapest_total);
+
+				if (uniq_neq_total)
+				{
+					cheapest_uniq_total = get_singleton_append_subpath(cheapest_uniq_total);
+					uniq_total_subpaths = lappend(uniq_total_subpaths, cheapest_uniq_total);
+				}
 			}
 			else if (match_partition_order_desc)
 			{
@@ -1851,6 +1876,12 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 
 				startup_subpaths = lcons(cheapest_startup, startup_subpaths);
 				total_subpaths = lcons(cheapest_total, total_subpaths);
+
+				if (uniq_neq_total)
+				{
+					cheapest_uniq_total = get_singleton_append_subpath(cheapest_uniq_total);
+					uniq_total_subpaths = lcons(cheapest_uniq_total, uniq_total_subpaths);
+				}
 			}
 			else
 			{
@@ -1862,6 +1893,11 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 										  &startup_subpaths, NULL);
 				accumulate_append_subpath(cheapest_total,
 										  &total_subpaths, NULL);
+				if (uniq_neq_total)
+				{
+					accumulate_append_subpath(cheapest_uniq_total,
+											  &uniq_total_subpaths, NULL);
+				}
 			}
 		}
 
@@ -1888,6 +1924,16 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 														  0,
 														  false,
 														  -1));
+			if (uniq_neq_total)
+				add_unique_path(rel, (Path *) create_append_path(root,
+														  rel,
+														  uniq_total_subpaths,
+														  NIL,
+														  pathkeys,
+														  NULL,
+														  0,
+														  false,
+														  -1));
 		}
 		else
 		{
@@ -1903,6 +1949,12 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 																total_subpaths,
 																pathkeys,
 																NULL));
+			if (uniq_neq_total)
+				add_unique_path(rel, (Path *)  create_merge_append_path(root,
+																		rel,
+																		uniq_total_subpaths,
+																		pathkeys,
+																		NULL));
 		}
 	}
 }
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 1e4d404f02..ff75c02003 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -133,6 +133,7 @@ int			max_parallel_workers_per_gather = 2;
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
+bool		enable_indexskipscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 0e4e00eaf0..ba2dd30a13 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -784,6 +784,16 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexPath  *ipath = (IndexPath *) lfirst(lc);
 
+		/*
+		 * To prevent unique paths from index skip scans being potentially used
+		 * when not needed scan keep them in a separate pathlist.
+		*/
+		if (ipath->indexdistinct)
+		{
+			add_unique_path(rel, (Path *) ipath);
+			continue;
+		}
+
 		if (index->amhasgettuple)
 			add_path(rel, (Path *) ipath);
 
@@ -872,6 +882,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
+	bool		can_skip;
 	int			indexcol;
 
 	/*
@@ -1021,6 +1032,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	index_only_scan = (scantype != ST_BITMAPSCAN &&
 					   check_index_only(rel, index));
 
+	/* Check if an index skip scan is possible. */
+	can_skip = enable_indexskipscan & index->amcanskip;
+
 	/*
 	 * 4. Generate an indexscan path if there are relevant restriction clauses
 	 * in the current clauses, OR the index ordering is potentially useful for
@@ -1044,6 +1058,33 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  false);
 		result = lappend(result, ipath);
 
+		/* Consider index skip scan as well */
+		if (root->query_uniquekeys != NULL && can_skip)
+		{
+			int numusefulkeys = list_length(useful_pathkeys);
+			int numsortkeys = list_length(root->query_pathkeys);
+
+			if (numusefulkeys == numsortkeys)
+			{
+				int prefix;
+				if (list_length(root->distinct_pathkeys) > 0)
+					prefix = find_index_prefix_for_pathkey(root,
+														   index,
+														   ForwardScanDirection,
+														   llast_node(PathKey,
+														   root->distinct_pathkeys));
+				else
+					/* all are distinct keys are constant and optimized away.
+					 * skipping with 1 is sufficient as all are constant anyway
+					 */
+					prefix = 1;
+
+				result = lappend(result,
+								 create_skipscan_unique_path(root, index,
+															 (Path *) ipath, prefix));
+			}
+		}
+
 		/*
 		 * If appropriate, consider parallel index scan.  We don't allow
 		 * parallel index scan for bitmap index scans.
@@ -1099,6 +1140,33 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  false);
 			result = lappend(result, ipath);
 
+			/* Consider index skip scan as well */
+			if (root->query_uniquekeys != NULL && can_skip)
+			{
+				int numusefulkeys = list_length(useful_pathkeys);
+				int numsortkeys = list_length(root->query_pathkeys);
+
+				if (numusefulkeys == numsortkeys)
+				{
+					int prefix;
+					if (list_length(root->distinct_pathkeys) > 0)
+						prefix = find_index_prefix_for_pathkey(root,
+															   index,
+															   BackwardScanDirection,
+															   llast_node(PathKey,
+															   root->distinct_pathkeys));
+					else
+						/* all are distinct keys are constant and optimized away.
+						 * skipping with 1 is sufficient as all are constant anyway
+						 */
+						prefix = 1;
+
+					result = lappend(result,
+									 create_skipscan_unique_path(root, index,
+																 (Path *) ipath, prefix));
+				}
+			}
+
 			/* If appropriate, consider parallel index scan */
 			if (index->amcanparallel &&
 				rel->consider_parallel && outer_relids == NULL &&
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index ad4fe19872..519bbbb788 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -522,6 +522,78 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  *		NEW PATHKEY FORMATION
  ****************************************************************************/
 
+/*
+ * Find the prefix size for a specific path key in an index.
+ * For example, an index with (a,b,c) finding path key b will
+ * return prefix 2.
+ * Returns 0 when not found.
+ */
+int
+find_index_prefix_for_pathkey(PlannerInfo *root,
+					 IndexOptInfo *index,
+					 ScanDirection scandir,
+					 PathKey *pathkey)
+{
+	ListCell   *lc;
+	int			i;
+
+	i = 0;
+	foreach(lc, index->indextlist)
+	{
+		TargetEntry *indextle = (TargetEntry *) lfirst(lc);
+		Expr	   *indexkey;
+		bool		reverse_sort;
+		bool		nulls_first;
+		PathKey    *cpathkey;
+
+		/*
+		 * INCLUDE columns are stored in index unordered, so they don't
+		 * support ordered index scan.
+		 */
+		if (i >= index->nkeycolumns)
+			break;
+
+		/* We assume we don't need to make a copy of the tlist item */
+		indexkey = indextle->expr;
+
+		if (ScanDirectionIsBackward(scandir))
+		{
+			reverse_sort = !index->reverse_sort[i];
+			nulls_first = !index->nulls_first[i];
+		}
+		else
+		{
+			reverse_sort = index->reverse_sort[i];
+			nulls_first = index->nulls_first[i];
+		}
+
+		/*
+		 * OK, try to make a canonical pathkey for this sort key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  indexkey,
+											  NULL,
+											  index->sortopfamily[i],
+											  index->opcintype[i],
+											  index->indexcollations[i],
+											  reverse_sort,
+											  nulls_first,
+											  0,
+											  index->rel->relids,
+											  false);
+
+		if (cpathkey == pathkey)
+		{
+			return i + 1;
+		}
+
+		i++;
+	}
+
+	return 0;
+}
+
 /*
  * build_index_pathkeys
  *	  Build a pathkeys list that describes the ordering induced by an index
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 3dc0176a51..673e122714 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -185,15 +185,20 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 								 Oid indexid, List *indexqual, List *indexqualorig,
 								 List *indexorderby, List *indexorderbyorig,
 								 List *indexorderbyops,
-								 ScanDirection indexscandir);
+								 ScanDirection indexscandir,
+								 int skipprefix,
+								 bool distinct);
 static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 Index scanrelid, Oid indexid,
 										 List *indexqual, List *indexorderby,
 										 List *indextlist,
-										 ScanDirection indexscandir);
+										 ScanDirection indexscandir,
+										 int skipprefix,
+										 bool distinct);
 static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
 											  List *indexqual,
-											  List *indexqualorig);
+											  List *indexqualorig,
+											  int skipPrefixSize);
 static BitmapHeapScan *make_bitmap_heapscan(List *qptlist,
 											List *qpqual,
 											Plan *lefttree,
@@ -3066,7 +3071,9 @@ create_indexscan_plan(PlannerInfo *root,
 												fixed_indexquals,
 												fixed_indexorderbys,
 												best_path->indexinfo->indextlist,
-												best_path->indexscandir);
+												best_path->indexscandir,
+												best_path->indexskipprefix,
+												best_path->indexdistinct);
 	else
 		scan_plan = (Scan *) make_indexscan(tlist,
 											qpqual,
@@ -3077,7 +3084,9 @@ create_indexscan_plan(PlannerInfo *root,
 											fixed_indexorderbys,
 											indexorderbys,
 											indexorderbyops,
-											best_path->indexscandir);
+											best_path->indexscandir,
+											best_path->indexskipprefix,
+											best_path->indexdistinct);
 
 	copy_generic_path_info(&scan_plan->plan, &best_path->path);
 
@@ -3367,7 +3376,8 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		plan = (Plan *) make_bitmap_indexscan(iscan->scan.scanrelid,
 											  iscan->indexid,
 											  iscan->indexqual,
-											  iscan->indexqualorig);
+											  iscan->indexqualorig,
+											  iscan->indexskipprefixsize);
 		/* and set its cost/width fields appropriately */
 		plan->startup_cost = 0.0;
 		plan->total_cost = ipath->indextotalcost;
@@ -5410,7 +5420,9 @@ make_indexscan(List *qptlist,
 			   List *indexorderby,
 			   List *indexorderbyorig,
 			   List *indexorderbyops,
-			   ScanDirection indexscandir)
+			   ScanDirection indexscandir,
+			   int skipPrefixSize,
+			   bool distinct)
 {
 	IndexScan  *node = makeNode(IndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5427,6 +5439,8 @@ make_indexscan(List *qptlist,
 	node->indexorderbyorig = indexorderbyorig;
 	node->indexorderbyops = indexorderbyops;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
+	node->indexdistinct = distinct;
 
 	return node;
 }
@@ -5439,7 +5453,9 @@ make_indexonlyscan(List *qptlist,
 				   List *indexqual,
 				   List *indexorderby,
 				   List *indextlist,
-				   ScanDirection indexscandir)
+				   ScanDirection indexscandir,
+				   int skipPrefixSize,
+				   bool distinct)
 {
 	IndexOnlyScan *node = makeNode(IndexOnlyScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5454,6 +5470,8 @@ make_indexonlyscan(List *qptlist,
 	node->indexorderby = indexorderby;
 	node->indextlist = indextlist;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
+	node->indexdistinct = distinct;
 
 	return node;
 }
@@ -5462,7 +5480,8 @@ static BitmapIndexScan *
 make_bitmap_indexscan(Index scanrelid,
 					  Oid indexid,
 					  List *indexqual,
-					  List *indexqualorig)
+					  List *indexqualorig,
+					  int skipPrefixSize)
 {
 	BitmapIndexScan *node = makeNode(BitmapIndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5475,6 +5494,7 @@ make_bitmap_indexscan(Index scanrelid,
 	node->indexid = indexid;
 	node->indexqual = indexqual;
 	node->indexqualorig = indexqualorig;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index ea2408c13f..91b1ca2634 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3109,12 +3109,18 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 
 	if (parse->distinctClause &&
 		grouping_is_sortable(parse->distinctClause))
+	{
 		root->distinct_pathkeys =
 			make_pathkeys_for_sortclauses(root,
 										  parse->distinctClause,
 										  tlist);
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
+	{
 		root->distinct_pathkeys = NIL;
+		root->query_uniquekeys = NIL;
+	}
 
 	root->sort_pathkeys =
 		make_pathkeys_for_sortclauses(root,
@@ -4441,7 +4447,7 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							RelOptInfo *distinct_rel)
 {
 	Query	   *parse = root->parse;
-	Path	   *cheapest_input_path = input_rel->cheapest_total_path;
+	Path	   *cheapest_input_path = input_rel->cheapest_distinct_unique_path;
 	double		numDistinctRows;
 	bool		allow_hash;
 	Path	   *path;
@@ -4514,8 +4520,14 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
 		{
 			Path	   *path = (Path *) lfirst(lc);
 
-			if (query_has_uniquekeys_for(root, needed_pathkeys, false))
+			if (query_has_uniquekeys_for(root, path->uniquekeys, false))
 				add_path(distinct_rel, path);
+			else if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
+				add_path(distinct_rel, (Path *)
+						 create_upper_unique_path(root, distinct_rel,
+												  path,
+												  list_length(root->distinct_pathkeys),
+												  numDistinctRows));
 		}
 
 		/* For explicit-sort case, always use the more rigorous clause */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 74e100e5a9..16633fd672 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -245,6 +245,7 @@ set_cheapest(RelOptInfo *parent_rel)
 {
 	Path	   *cheapest_startup_path;
 	Path	   *cheapest_total_path;
+	Path	   *cheapest_distinct_unique_path;
 	Path	   *best_param_path;
 	List	   *parameterized_paths;
 	ListCell   *p;
@@ -256,6 +257,7 @@ set_cheapest(RelOptInfo *parent_rel)
 
 	cheapest_startup_path = cheapest_total_path = best_param_path = NULL;
 	parameterized_paths = NIL;
+	cheapest_distinct_unique_path = NULL;
 
 	foreach(p, parent_rel->pathlist)
 	{
@@ -354,6 +356,36 @@ set_cheapest(RelOptInfo *parent_rel)
 		cheapest_total_path = best_param_path;
 	Assert(cheapest_total_path != NULL);
 
+	cheapest_distinct_unique_path = cheapest_total_path;
+
+	foreach(p, parent_rel->unique_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(p);
+		int			cmp;
+
+		/* Unparameterized path, so consider it for cheapest slots */
+		if (cheapest_distinct_unique_path == NULL)
+		{
+			cheapest_distinct_unique_path = path;
+			continue;
+		}
+
+		/*
+		 * If we find two paths of identical costs, try to keep the
+		 * better-sorted one.  The paths might have unrelated sort
+		 * orderings, in which case we can only guess which might be
+		 * better to keep, but if one is superior then we definitely
+		 * should keep that one.
+		 */
+		cmp = compare_path_costs(cheapest_distinct_unique_path, path, TOTAL_COST);
+		if (cmp > 0 ||
+			(cmp == 0 &&
+			 compare_pathkeys(cheapest_distinct_unique_path->pathkeys,
+							  path->pathkeys) == PATHKEYS_BETTER2))
+			cheapest_distinct_unique_path = path;
+	}
+
+	parent_rel->cheapest_distinct_unique_path = cheapest_distinct_unique_path;
 	parent_rel->cheapest_startup_path = cheapest_startup_path;
 	parent_rel->cheapest_total_path = cheapest_total_path;
 	parent_rel->cheapest_unique_path = NULL;	/* computed only if needed */
@@ -1293,6 +1325,10 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
 	pathnode->path.pathkeys = pathkeys;
+	if (list_length(subpaths) == 1)
+	{
+		pathnode->path.uniquekeys = ((Path*)linitial(subpaths))->uniquekeys;
+	}
 
 	/*
 	 * For parallel append, non-partial paths are sorted by descending total
@@ -1437,6 +1473,10 @@ create_merge_append_path(PlannerInfo *root,
 	pathnode->path.parallel_workers = 0;
 	pathnode->path.pathkeys = pathkeys;
 	pathnode->subpaths = subpaths;
+	if (list_length(subpaths) == 1)
+	{
+		pathnode->path.uniquekeys = ((Path*)linitial(subpaths))->uniquekeys;
+	}
 
 	/*
 	 * Apply query-wide LIMIT if known and path is for sole base relation.
@@ -3094,6 +3134,44 @@ create_upper_unique_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_skipscan_unique_path
+ *	  Creates a pathnode the same as an existing IndexPath except based on
+ *	  skipping duplicate values.  This may or may not be cheaper than using
+ *	  create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root, IndexOptInfo *index,
+							Path *basepath, int prefix)
+{
+	IndexPath 	*pathnode = makeNode(IndexPath);
+	int 		numDistinctRows;
+	UniqueKey *ukey;
+
+	Assert(IsA(basepath, IndexPath));
+
+	/* We don't want to modify basepath, so make a copy. */
+	memcpy(pathnode, basepath, sizeof(IndexPath));
+
+	ukey = linitial_node(UniqueKey, root->query_uniquekeys);
+
+	Assert(prefix > 0);
+	pathnode->indexskipprefix = prefix;
+	pathnode->indexdistinct = true;
+	pathnode->path.uniquekeys = root->query_uniquekeys;
+
+	numDistinctRows = estimate_num_groups(root, ukey->exprs,
+										  pathnode->path.rows,
+										  NULL, NULL);
+
+	pathnode->path.total_cost = pathnode->path.startup_cost * numDistinctRows;
+	pathnode->path.rows = numDistinctRows;
+
+	return pathnode;
+}
+
 /*
  * create_agg_path
  *	  Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index a50f897ffa..e67026ac3d 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -272,6 +272,9 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanskip = (amroutine->amskip != NULL &&
+					amroutine->amgetskiptuple != NULL &&
+					amroutine->ambeginskipscan != NULL);
 			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index d2ce4a8450..d393dbe2aa 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1000,6 +1000,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-skip-scan plans."),
+			NULL
+		},
+		&enable_indexskipscan,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 3fe9a53cb3..df48ee95c4 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -367,6 +367,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexskipscan = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 9e93908c65..eb17cbbc9b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1009,7 +1009,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, NULL, NULL);
 
 	if (state->indexInfo->ii_Expressions != NULL)
 	{
@@ -1104,7 +1104,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->indexRel = indexRel;
 	state->enforceUnique = enforceUnique;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, NULL, NULL);
 
 	/* Prepare SortSupport data for each column */
 	state->sortKeys = (SortSupport) palloc0(state->nKeys *
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index d357ebb559..5f436476ef 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -162,6 +162,12 @@ typedef IndexScanDesc (*ambeginscan_function) (Relation indexRelation,
 											   int nkeys,
 											   int norderbys);
 
+/* prepare for index scan with skip */
+typedef IndexScanDesc (*ambeginscan_skip_function) (Relation indexRelation,
+											   int nkeys,
+											   int norderbys,
+											   int prefix);
+
 /* (re)start index scan */
 typedef void (*amrescan_function) (IndexScanDesc scan,
 								   ScanKey keys,
@@ -173,6 +179,16 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next valid tuple */
+typedef bool (*amgettuple_with_skip_function) (IndexScanDesc scan,
+											   ScanDirection prefixDir,
+											   ScanDirection postfixDir);
+
+/* skip past duplicates */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+								 ScanDirection prefixDir,
+								 ScanDirection postfixDir);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -269,12 +285,15 @@ typedef struct IndexAmRoutine
 	amvalidate_function amvalidate;
 	amadjustmembers_function amadjustmembers;	/* can be NULL */
 	ambeginscan_function ambeginscan;
+	ambeginscan_skip_function ambeginskipscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgettuple_with_skip_function amgetskiptuple; /* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
 	amrestrpos_function amrestrpos; /* can be NULL */
+	amskip_function amskip;				/* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 480a4762f5..38f51b4690 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -152,9 +152,17 @@ extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
 									 int nkeys, int norderbys);
+extern IndexScanDesc index_beginscan_skip(Relation heapRelation,
+									 Relation indexRelation,
+									 Snapshot snapshot,
+									 int nkeys, int norderbys, int prefix);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
+extern IndexScanDesc index_beginscan_bitmap_skip(Relation indexRelation,
+											Snapshot snapshot,
+											int nkeys,
+											int prefix);
 extern void index_rescan(IndexScanDesc scan,
 						 ScanKey keys, int nkeys,
 						 ScanKey orderbys, int norderbys);
@@ -170,10 +178,16 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
+extern ItemPointer index_getnext_tid_skip(IndexScanDesc scan,
+									 ScanDirection prefixDir,
+									 ScanDirection postfixDir);
 struct TupleTableSlot;
 extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 							   struct TupleTableSlot *slot);
+extern bool index_getnext_slot_skip(IndexScanDesc scan, ScanDirection prefixDir,
+									ScanDirection postfixDir,
+									struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
@@ -183,6 +197,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
 												   IndexBulkDeleteResult *istat);
 extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection prefixDir,
+					   ScanDirection postfixDir);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 30a216e4c0..de19bbc878 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1027,6 +1027,54 @@ typedef struct BTArrayKeyInfo
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
 
+typedef struct BTSkipCompareResult
+{
+	bool		equal;
+	int			prefixCmpResult, skCmpResult;
+	bool		prefixSkip, fullKeySkip;
+	int			prefixSkipIndex;
+} BTSkipCompareResult;
+
+typedef enum BTSkipState
+{
+	SkipStateStop,
+	SkipStateSkip,
+	SkipStateSkipExtra,
+	SkipStateNext
+} BTSkipState;
+
+typedef struct BTSkipPosData
+{
+	BTSkipState nextAction;
+	ScanDirection nextDirection;
+	int nextSkipIndex;
+	BTScanInsertData skipScanKey;
+	char skipTuple[BLCKSZ]; /* tuple data where skipScanKey Datum's point to */
+} BTSkipPosData;
+
+typedef struct BTSkipData
+{
+	/* used to control skipping
+	 * curPos.skipScanKey is a combination of currentTupleKey and fwdScanKey/bwdScanKey.
+	 * currentTupleKey contains the scan keys for the current tuple
+	 * fwdScanKey contains the scan keys for quals that would be chosen for a forward scan
+	 * bwdScanKey contains the scan keys for quals that would be chosen for a backward scan
+	 * we need both fwd and bwd, because the scan keys differ for going fwd and bwd
+	 * if a qual would be a>2 and a<5, fwd would have a>2, while bwd would have a<5
+	 */
+	BTScanInsertData	currentTupleKey;
+	BTScanInsertData	fwdScanKey;
+	ScanKeyData			fwdNotNullKeys[INDEX_MAX_KEYS];
+	BTScanInsertData	bwdScanKey;
+	ScanKeyData			bwdNotNullKeys[INDEX_MAX_KEYS];
+	/* length of prefix to skip */
+	int					prefix;
+
+	BTSkipPosData curPos, markPos;
+} BTSkipData;
+
+typedef BTSkipData *BTSkip;
+
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1064,6 +1112,9 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* Work space for _bt_skip */
+	BTSkip	skipData;	/* used to control skipping */
+
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -1078,6 +1129,8 @@ typedef BTScanOpaqueData *BTScanOpaque;
  */
 #define SK_BT_REQFWD	0x00010000	/* required to continue forward scan */
 #define SK_BT_REQBKWD	0x00020000	/* required to continue backward scan */
+#define SK_BT_REQSKIPFWD	0x00040000	/* required to continue forward scan within current prefix */
+#define SK_BT_REQSKIPBKWD	0x00080000	/* required to continue backward scan within current prefix */
 #define SK_BT_INDOPTION_SHIFT  24	/* must clear the above bits */
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -1124,9 +1177,12 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 					 bool indexUnchanged,
 					 struct IndexInfo *indexInfo);
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc btbeginscan_skip(Relation rel, int nkeys, int norderbys, int skipPrefix);
 extern Size btestimateparallelscan(void);
 extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool btgettuple_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
+extern bool btskip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
@@ -1227,15 +1283,81 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
 							bool forupdate, BTStack stack, int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_first(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
+extern bool _bt_next(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 							   Snapshot snapshot);
+extern Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+extern void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+extern bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
+						 OffsetNumber *offnum, bool isRegularMode);
+extern bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
+extern void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+
+/*
+* prototypes for functions in nbtskip.c
+*/
+static inline bool
+_bt_skip_enabled(BTScanOpaque so)
+{
+	return so->skipData != NULL;
+}
+
+static inline bool
+_bt_skip_is_regular_mode(ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	return prefixDir == postfixDir;
+}
+
+/* returns whether or not we can use extra quals in the scankey after skipping to a prefix */
+static inline bool
+_bt_has_extra_quals_after_skip(BTSkip skip, ScanDirection dir, int prefix)
+{
+	if (ScanDirectionIsForward(dir))
+	{
+		return skip->fwdScanKey.keysz > prefix;
+	}
+	else
+	{
+		return skip->bwdScanKey.keysz > prefix;
+	}
+}
+
+/* alias of BTScanPosIsValid */
+static inline bool
+_bt_skip_is_always_valid(BTScanOpaque so)
+{
+	return BTScanPosIsValid(so->currPos);
+}
+
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
+extern void _bt_skip_create_scankeys(Relation rel, BTScanOpaque so);
+extern void _bt_skip_update_scankey_for_extra_skip(IndexScanDesc scan, Relation indexRel,
+					ScanDirection curDir, ScanDirection prefixDir, bool prioritizeEqual, IndexTuple itup);
+extern void _bt_skip_once(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+						  bool forceSkip, ScanDirection prefixDir, ScanDirection postfixDir);
+extern void _bt_skip_extra_conditions(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+									  ScanDirection prefixDir, ScanDirection postfixDir, BTSkipCompareResult *cmp);
+extern bool _bt_skip_find_next(IndexScanDesc scan, IndexTuple curTuple, OffsetNumber curTupleOffnum,
+							   ScanDirection prefixDir, ScanDirection postfixDir);
+extern void _bt_skip_until_match(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+								 ScanDirection prefixDir, ScanDirection postfixDir);
+extern bool _bt_has_results(BTScanOpaque so);
+extern void _bt_compare_current_item(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+									 ScanDirection dir, bool isRegularMode, BTSkipCompareResult* cmp);
+extern bool _bt_step_back_page(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum);
+extern bool _bt_step_forward_page(IndexScanDesc scan, BlockNumber next, IndexTuple *curTuple,
+								  OffsetNumber *curTupleOffnum);
+extern bool _bt_checkkeys_skip(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+							   ScanDirection dir, bool *continuescan, int *prefixskipindex);
+extern IndexTuple
+_bt_get_tuple_from_offset(BTScanOpaque so, OffsetNumber curTupleOffnum);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -1244,7 +1366,7 @@ extern void _bt_mark_array_keys(IndexScanDesc scan);
 extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
 extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan);
+						  int tupnatts, ScanDirection dir, bool *continuescan, int *indexSkipPrefix);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
@@ -1266,6 +1388,19 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
 extern bool _bt_allequalimage(Relation rel, bool debugmessage);
+extern bool _bt_checkkeys_threeway(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+				ScanDirection dir, bool *continuescan, int *prefixSkipIndex);
+extern bool _bt_create_insertion_scan_key(Relation	rel, ScanDirection dir,
+				ScanKey* startKeys, int keysCount,
+				BTScanInsert inskey, StrategyNumber* stratTotal,
+				bool* goback);
+extern void _bt_set_bsearch_flags(StrategyNumber stratTotal, ScanDirection dir,
+		bool* nextkey, bool* goback);
+extern int _bt_choose_scan_keys(ScanKey scanKeys, int numberOfKeys, ScanDirection dir,
+ScanKey* startKeys, ScanKeyData* notnullkeys,
+  StrategyNumber* stratTotal, int prefix);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup, BTScanInsert key);
+extern void print_itup(BlockNumber blk, IndexTuple left, IndexTuple right, Relation rel, char *extra);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index cd57a704ad..1833fa946b 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -455,9 +455,13 @@ extern Datum ExecMakeFunctionResultSet(SetExprState *fcache,
  */
 typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
 typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
+typedef bool (*ExecScanSkipMtd) (ScanState *node);
 
 extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
 								ExecScanRecheckMtd recheckMtd);
+extern TupleTableSlot *ExecScanExtended(ScanState *node, ExecScanAccessMtd accessMtd,
+								ExecScanRecheckMtd recheckMtd,
+								ExecScanSkipMtd skipMtd);
 extern void ExecAssignScanProjectionInfo(ScanState *node);
 extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, int varno);
 extern void ExecScanReScan(ScanState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 2e8cbee69f..4a7c750d18 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1379,6 +1379,7 @@ typedef struct ScanState
 	Relation	ss_currentRelation;
 	struct TableScanDescData *ss_currentScanDesc;
 	TupleTableSlot *ss_ScanTupleSlot;
+	bool ss_FirstTupleEmitted;
 } ScanState;
 
 /* ----------------
@@ -1475,6 +1476,8 @@ typedef struct IndexScanState
 	ExprContext *iss_RuntimeContext;
 	Relation	iss_RelationDesc;
 	struct IndexScanDescData *iss_ScanDesc;
+	int			iss_SkipPrefixSize;
+	bool		iss_Distinct;
 
 	/* These are needed for re-checking ORDER BY expr ordering */
 	pairingheap *iss_ReorderQueue;
@@ -1504,6 +1507,8 @@ typedef struct IndexScanState
  *		TableSlot		   slot for holding tuples fetched from the table
  *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
+ *		SkipPrefixSize	   number of keys for skip-based DISTINCT
+ *		FirstTupleEmitted  has the first tuple been emitted
  * ----------------
  */
 typedef struct IndexOnlyScanState
@@ -1522,6 +1527,8 @@ typedef struct IndexOnlyScanState
 	struct IndexScanDescData *ioss_ScanDesc;
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
+	int			ioss_SkipPrefixSize;
+	bool		ioss_Distinct;
 	Size		ioss_PscanLen;
 } IndexOnlyScanState;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 3ae6b91576..87042223c9 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -699,6 +699,7 @@ typedef struct RelOptInfo
 	List       *unique_pathlist;    /* unique Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
+	struct Path *cheapest_distinct_unique_path;
 	struct Path *cheapest_unique_path;
 	List	   *cheapest_parameterized_paths;
 
@@ -1267,6 +1268,9 @@ typedef struct Path
  * we need not recompute them when considering using the same index in a
  * bitmap index/heap scan (see BitmapHeapPath).  The costs of the IndexPath
  * itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
  *----------
  */
 typedef struct IndexPath
@@ -1279,6 +1283,8 @@ typedef struct IndexPath
 	ScanDirection indexscandir;
 	Cost		indextotalcost;
 	Selectivity indexselectivity;
+	int			indexskipprefix;
+	bool		indexdistinct;
 } IndexPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 01a246d50e..79b70b9414 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -411,6 +411,8 @@ typedef struct IndexScan
 	List	   *indexorderbyorig;	/* the same in original form */
 	List	   *indexorderbyops;	/* OIDs of sort ops for ORDER BY exprs */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for skip scans */
+	bool		indexdistinct; /* whether only distinct keys are requested */
 } IndexScan;
 
 /* ----------------
@@ -438,6 +440,8 @@ typedef struct IndexOnlyScan
 	List	   *indexorderby;	/* list of index ORDER BY exprs */
 	List	   *indextlist;		/* TargetEntry list describing index's cols */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for skip scans */
+	bool		indexdistinct; /* whether only distinct keys are requested */
 } IndexOnlyScan;
 
 /* ----------------
@@ -464,6 +468,7 @@ typedef struct BitmapIndexScan
 	bool		isshared;		/* Create shared bitmap if set */
 	List	   *indexqual;		/* list of index quals (OpExprs) */
 	List	   *indexqualorig;	/* the same in original form */
+	int			indexskipprefixsize;	/* the size of the prefix for skip scans */
 } BitmapIndexScan;
 
 /* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 2113bc82de..561ab023e2 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
 extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index facb2dfe74..0343b2e1f6 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -217,6 +217,10 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
 												 Path *subpath,
 												 int numCols,
 												 double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+											  IndexOptInfo *index,
+											  Path *subpath,
+											  int prefix);
 extern AggPath *create_agg_path(PlannerInfo *root,
 								RelOptInfo *rel,
 								Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index e71e65264a..32f0527019 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -212,6 +212,10 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   Relids required_outer,
 													   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
+extern int find_index_prefix_for_pathkey(PlannerInfo *root,
+					 IndexOptInfo *index,
+					 ScanDirection scandir,
+					 PathKey *pathkey);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 								  ScanDirection scandir);
 extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
diff --git a/src/interfaces/libpq/encnames.c b/src/interfaces/libpq/encnames.c
new file mode 120000
index 0000000000..ca78618b55
--- /dev/null
+++ b/src/interfaces/libpq/encnames.c
@@ -0,0 +1 @@
+../../../src/backend/utils/mb/encnames.c
\ No newline at end of file
diff --git a/src/interfaces/libpq/wchar.c b/src/interfaces/libpq/wchar.c
new file mode 120000
index 0000000000..a27508f72a
--- /dev/null
+++ b/src/interfaces/libpq/wchar.c
@@ -0,0 +1 @@
+../../../src/backend/utils/mb/wchar.c
\ No newline at end of file
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index 58122c6f88..ec98dbf63b 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -375,3 +375,602 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
  t
 (1 row)
 
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+SELECT DISTINCT a FROM distinct_a;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+ a 
+---
+ 1
+(1 row)
+
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Index Only Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: Distinct only
+(2 rows)
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: Distinct only
+   Index Cond: (b = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: Distinct only
+   Index Cond: (b = 2)
+(3 rows)
+
+DROP INDEX distinct_a_b_a;
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b 
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+                          QUERY PLAN                          
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: Distinct only
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+                              QUERY PLAN                               
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: Distinct only
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+ 2 | 1 | 10
+ 3 | 1 | 10
+ 4 | 1 | 10
+ 5 | 1 | 10
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: Distinct only
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: Distinct only
+   Index Cond: (a = 1)
+(3 rows)
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: Distinct only
+   Index Cond: (b = 2)
+   Filter: (c = 10)
+(4 rows)
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+ a | a 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 3
+ 4 | 4
+ 5 | 5
+(5 rows)
+
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+ a | ?column? 
+---+----------
+ 1 |        1
+ 2 |        1
+ 3 |        1
+ 4 |        1
+ 5 |        1
+(5 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+FETCH FROM c;
+ a 
+---
+ 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a 
+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 |  9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+                                 QUERY PLAN                                 
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+   Skip scan: Distinct only
+   Index Cond: ((b >= 1) AND (c = 0))
+(3 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c 
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
+-- test tuple killing
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 1 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 5 | 1 | 1 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 5 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 1 | 1 | 1 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(5 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(5 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 6e54f3e15e..282cae21b3 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -103,6 +103,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexskipscan           | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -115,7 +116,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(20 rows)
+(21 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index 1bfe59c26f..708aa2a746 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -174,3 +174,251 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
 SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
 SELECT 2 IS NOT DISTINCT FROM null as "no";
 SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+
+SELECT DISTINCT a FROM distinct_a;
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+DROP INDEX distinct_a_b_a;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
+
+-- test tuple killing
+
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
-- 
2.29.2

v2-0003-Support-skip-scan-for-non-distinct-scans.patchapplication/octet-stream; name=v2-0003-Support-skip-scan-for-non-distinct-scans.patchDownload

From e3bc921ad94c82c677886c2baef9a4876f6faeea Mon Sep 17 00:00:00 2001
From: Floris van Nee <floris.vannee@gmail.com>
Date: Thu, 19 Mar 2020 10:27:47 +0100
Subject: [PATCH 5/5] Support skip scan for non-distinct scans

Adds planner support to choose a skip scan for regular
non-distinct queries like:
SELECT * FROM t1 WHERE b=1 (with index on (a,b))
---
 src/backend/optimizer/path/indxpath.c | 181 +++++++++++++++++++++++++-
 src/backend/optimizer/plan/planner.c  |   2 +-
 src/backend/optimizer/util/pathnode.c |   4 +-
 src/backend/utils/adt/selfuncs.c      | 153 ++++++++++++++++++++--
 src/include/optimizer/pathnode.h      |   3 +-
 5 files changed, 327 insertions(+), 16 deletions(-)

diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index ba2dd30a13..d4d3e1c7eb 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -192,6 +192,17 @@ static Expr *match_clause_to_ordering_op(IndexOptInfo *index,
 static bool ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel,
 									   EquivalenceClass *ec, EquivalenceMember *em,
 									   void *arg);
+static List* add_possible_index_skip_paths(List* result,
+										  PlannerInfo *root,
+										  IndexOptInfo *index,
+										  List *indexclauses,
+										  List *indexorderbys,
+										  List *indexorderbycols,
+										  List *pathkeys,
+										  ScanDirection indexscandir,
+										  bool indexonly,
+										  Relids required_outer,
+										  double loop_count);
 
 
 /*
@@ -820,6 +831,136 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	}
 }
 
+/*
+ * Find available index skip paths and add them to the path list
+ */
+static List* add_possible_index_skip_paths(List* result,
+										  PlannerInfo *root,
+										  IndexOptInfo *index,
+										  List *indexclauses,
+										  List *indexorderbys,
+										  List *indexorderbycols,
+										  List *pathkeys,
+										  ScanDirection indexscandir,
+										  bool indexonly,
+										  Relids required_outer,
+										  double loop_count)
+{
+	int			indexcol;
+	bool		eqQualHere;
+	bool		eqQualPrev;
+	bool		eqSoFar;
+	ListCell   *lc;
+
+	/*
+	 * We need to find possible prefixes to use for the skip scan
+	 * Any useful prefix is one just before an index clause, unless
+	 * all clauses so far have been equal.
+	 * For example, on an index (a,b,c), the qual b=1 would
+	 * mean that an interesting skip prefix could be 1.
+	 * For qual a=1 AND b=1, it is not interesting to skip with
+	 * prefix 1, because the value of a is fixed already.
+	 */
+	indexcol = 0;
+	eqQualHere = false;
+	eqQualPrev = false;
+	eqSoFar = true;
+	foreach(lc, indexclauses)
+	{
+		IndexClause *iclause = lfirst_node(IndexClause, lc);
+		ListCell   *lc2;
+
+		if (indexcol != iclause->indexcol)
+		{
+			if (!eqQualHere || indexcol != iclause->indexcol - 1)
+				eqSoFar = false;
+
+			/* Beginning of a new column's quals */
+			if (!eqQualPrev && !eqSoFar)
+			{
+				/* We have a qual on current column,
+				 * there is no equality qual on the previous column,
+				 * not all of the previous quals are equality so far
+				 * (last one is special case for the first column in the index).
+				 * Optimal conditions to try an index skip path.
+				 */
+				IndexPath *ipath = create_index_path(root, index,
+										  indexclauses,
+										  indexorderbys,
+										  indexorderbycols,
+										  pathkeys,
+										  indexscandir,
+										  indexonly,
+										  required_outer,
+										  loop_count,
+										  false,
+										  iclause->indexcol);
+				result = lappend(result, ipath);
+			}
+
+			eqQualPrev = eqQualHere;
+			eqQualHere = false;
+			indexcol++;
+			/* if the clause is not for this index col, increment until it is */
+			while (indexcol != iclause->indexcol)
+			{
+				eqQualPrev = false;
+				eqSoFar = false;
+				indexcol++;
+			}
+		}
+
+		/* Examine each indexqual associated with this index clause */
+		foreach(lc2, iclause->indexquals)
+		{
+			RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc2);
+			Expr	   *clause = rinfo->clause;
+			Oid			clause_op = InvalidOid;
+			int			op_strategy;
+
+			if (IsA(clause, OpExpr))
+			{
+				OpExpr	   *op = (OpExpr *) clause;
+				clause_op = op->opno;
+			}
+			else if (IsA(clause, RowCompareExpr))
+			{
+				RowCompareExpr *rc = (RowCompareExpr *) clause;
+				clause_op = linitial_oid(rc->opnos);
+			}
+			else if (IsA(clause, ScalarArrayOpExpr))
+			{
+				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
+				clause_op = saop->opno;
+			}
+			else if (IsA(clause, NullTest))
+			{
+				NullTest   *nt = (NullTest *) clause;
+
+				if (nt->nulltesttype == IS_NULL)
+				{
+					/* IS NULL is like = for selectivity purposes */
+					eqQualHere = true;
+				}
+			}
+			else
+				elog(ERROR, "unsupported indexqual type: %d",
+					 (int) nodeTag(clause));
+
+			/* check for equality operator */
+			if (OidIsValid(clause_op))
+			{
+				op_strategy = get_op_opfamily_strategy(clause_op,
+													   index->opfamily[indexcol]);
+				Assert(op_strategy != 0);	/* not a member of opfamily?? */
+				if (op_strategy == BTEqualStrategyNumber)
+					eqQualHere = true;
+			}
+		}
+	}
+	return result;
+}
+
 /*
  * build_index_paths
  *	  Given an index and a set of index clauses for it, construct zero
@@ -1055,9 +1196,25 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  index_only_scan,
 								  outer_relids,
 								  loop_count,
-								  false);
+								  false,
+								  0);
 		result = lappend(result, ipath);
 
+		if (can_skip)
+		{
+			result = add_possible_index_skip_paths(result, root, index,
+												   index_clauses,
+												   orderbyclauses,
+												   orderbyclausecols,
+												   useful_pathkeys,
+												   index_is_ordered ?
+												   ForwardScanDirection :
+												   NoMovementScanDirection,
+												   index_only_scan,
+												   outer_relids,
+												   loop_count);
+		}
+
 		/* Consider index skip scan as well */
 		if (root->query_uniquekeys != NULL && can_skip)
 		{
@@ -1104,7 +1261,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  index_only_scan,
 									  outer_relids,
 									  loop_count,
-									  true);
+									  true,
+									  0);
 
 			/*
 			 * if, after costing the path, we find that it's not worth using
@@ -1137,9 +1295,23 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  index_only_scan,
 									  outer_relids,
 									  loop_count,
-									  false);
+									  false,
+									  0);
 			result = lappend(result, ipath);
 
+			if (can_skip)
+			{
+				result = add_possible_index_skip_paths(result, root, index,
+													   index_clauses,
+													   NIL,
+													   NIL,
+													   useful_pathkeys,
+													   BackwardScanDirection,
+													   index_only_scan,
+													   outer_relids,
+													   loop_count);
+			}
+
 			/* Consider index skip scan as well */
 			if (root->query_uniquekeys != NULL && can_skip)
 			{
@@ -1181,7 +1353,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 										  index_only_scan,
 										  outer_relids,
 										  loop_count,
-										  true);
+										  true,
+										  0);
 
 				/*
 				 * if, after costing the path, we find that it's not worth
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 91b1ca2634..affeee390c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6013,7 +6013,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 	indexScanPath = create_index_path(root, indexInfo,
 									  NIL, NIL, NIL, NIL,
 									  ForwardScanDirection, false,
-									  NULL, 1.0, false);
+									  NULL, 1.0, false, 0);
 
 	return (seqScanAndSortPath.total_cost < indexScanPath->path.total_cost);
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 16633fd672..2b1c64fb51 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1051,7 +1051,8 @@ create_index_path(PlannerInfo *root,
 				  bool indexonly,
 				  Relids required_outer,
 				  double loop_count,
-				  bool partial_path)
+				  bool partial_path,
+				  int skip_prefix)
 {
 	IndexPath  *pathnode = makeNode(IndexPath);
 	RelOptInfo *rel = index->rel;
@@ -1071,6 +1072,7 @@ create_index_path(PlannerInfo *root,
 	pathnode->indexorderbys = indexorderbys;
 	pathnode->indexorderbycols = indexorderbycols;
 	pathnode->indexscandir = indexscandir;
+	pathnode->indexskipprefix = skip_prefix;
 
 	cost_index(pathnode, root, loop_count, partial_path);
 
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 10895fb287..45f4fc814b 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -210,7 +210,9 @@ static bool get_actual_variable_endpoint(Relation heapRel,
 										 MemoryContext outercontext,
 										 Datum *endpointDatum);
 static RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids);
-
+static double estimate_num_groups_internal(PlannerInfo *root, List *groupExprs,
+									double input_rows, double rel_input_rows,
+									List **pgset, EstimationInfo *estinfo);
 
 /*
  *		eqsel			- Selectivity of "=" for any data types.
@@ -3367,6 +3369,19 @@ add_unique_group_var(PlannerInfo *root, List *varinfos,
 double
 estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 					List **pgset, EstimationInfo *estinfo)
+{
+	return estimate_num_groups_internal(root, groupExprs, input_rows, -1, pgset, estinfo);
+}
+
+/*
+ * Same as estimate_num_groups, but with an extra argument to control
+ * the estimation used for the input rows of the relation. If
+ * rel_input_rows < 0, it uses the the original planner estimation for the
+ * individual rels, else if uses the estimation as provided to the function.
+ */
+static double
+estimate_num_groups_internal(PlannerInfo *root, List *groupExprs, double input_rows, double rel_input_rows,
+					List **pgset, EstimationInfo *estinfo)
 {
 	List	   *varinfos = NIL;
 	double		srf_multiplier = 1.0;
@@ -3533,6 +3548,12 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 		int			relvarcount = 0;
 		List	   *newvarinfos = NIL;
 		List	   *relvarinfos = NIL;
+		double this_rel_input_rows;
+
+		if (rel_input_rows < 0.0)
+			this_rel_input_rows = rel->rows;
+		else
+			this_rel_input_rows = rel_input_rows;
 
 		/*
 		 * Split the list of varinfos in two - one for the current rel, one
@@ -3638,7 +3659,7 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 			 * guarding against division by zero when reldistinct is zero.
 			 * Also skip this if we know that we are returning all rows.
 			 */
-			if (reldistinct > 0 && rel->rows < rel->tuples)
+			if (reldistinct > 0 && this_rel_input_rows < rel->tuples)
 			{
 				/*
 				 * Given a table containing N rows with n distinct values in a
@@ -3675,7 +3696,7 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 				 * works well even when n is small.
 				 */
 				reldistinct *=
-					(1 - pow((rel->tuples - rel->rows) / rel->tuples,
+					(1 - pow((rel->tuples - this_rel_input_rows) / rel->tuples,
 							 rel->tuples / reldistinct));
 			}
 			reldistinct = clamp_row_est(reldistinct);
@@ -6621,8 +6642,10 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	double		numIndexTuples;
 	Cost		descentCost;
 	List	   *indexBoundQuals;
+	List	   *prefixBoundQuals;
 	int			indexcol;
 	bool		eqQualHere;
+	bool		stillEq;
 	bool		found_saop;
 	bool		found_is_null_op;
 	double		num_sa_scans;
@@ -6646,9 +6669,11 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
+	prefixBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
+	stillEq = true;
 	found_is_null_op = false;
 	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
@@ -6660,11 +6685,18 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 		{
 			/* Beginning of a new column's quals */
 			if (!eqQualHere)
-				break;			/* done if no '=' qual for indexcol */
+			{
+				stillEq = false;
+				/* done if no '=' qual for indexcol and we're past the skip prefix */
+				if (path->indexskipprefix <= indexcol)
+					break;
+			}
 			eqQualHere = false;
 			indexcol++;
+			while (indexcol != iclause->indexcol && path->indexskipprefix > indexcol)
+				indexcol++;
 			if (indexcol != iclause->indexcol)
-				break;			/* no quals at all for indexcol */
+				break; /* no quals at all for indexcol */
 		}
 
 		/* Examine each indexqual associated with this index clause */
@@ -6696,7 +6728,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 				clause_op = saop->opno;
 				found_saop = true;
 				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
+				if (alength > 1 && stillEq)
 					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
@@ -6724,7 +6756,14 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 					eqQualHere = true;
 			}
 
-			indexBoundQuals = lappend(indexBoundQuals, rinfo);
+			/* we keep two lists here, one with all quals up until the prefix
+			 * and one with only the quals until the first inequality.
+			 * we need the list with prefixes later
+			 */
+			if (stillEq)
+				indexBoundQuals = lappend(indexBoundQuals, rinfo);
+			if (path->indexskipprefix > 0)
+				prefixBoundQuals = lappend(prefixBoundQuals, rinfo);
 		}
 	}
 
@@ -6750,7 +6789,10 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 		 * index-bound quals to produce a more accurate idea of the number of
 		 * rows covered by the bound conditions.
 		 */
-		selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
+		if (path->indexskipprefix > 0)
+			selectivityQuals = add_predicate_to_index_quals(index, prefixBoundQuals);
+		else
+			selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
 
 		btreeSelectivity = clauselist_selectivity(root, selectivityQuals,
 												  index->rel->relid,
@@ -6760,7 +6802,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 		/*
 		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
+		 * ScalarArrayOpExpr quals included in prefixBoundQuals, and then round
 		 * to integer.
 		 */
 		numIndexTuples = rint(numIndexTuples / num_sa_scans);
@@ -6806,6 +6848,99 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	costs.indexStartupCost += descentCost;
 	costs.indexTotalCost += costs.num_sa_scans * descentCost;
 
+	/*
+	 * Add extra costs for using an index skip scan.
+	 * The index skip scan could have significantly lower cost until now,
+	 * due to the different row estimation used (all the quals up to prefix,
+	 * rather than all the quals up to the first non-equality operator).
+	 * However, there are extra costs incurred for
+	 * a) setting up the scan
+	 * b) doing additional scans from root
+	 * c) small extra cost per tuple comparison
+	 * We add those here
+	 */
+	if (path->indexskipprefix > 0)
+	{
+		List *exprlist = NULL;
+		double numgroups_estimate;
+		int i = 0;
+		ListCell *indexpr_item = list_head(path->indexinfo->indexprs);
+		List	   *selectivityQuals;
+		Selectivity btreeSelectivity;
+		double estimatedIndexTuplesNoPrefix;
+
+		/* some rather arbitrary extra cost for preprocessing structures needed for skip scan */
+		costs.indexStartupCost += 200.0 * cpu_operator_cost;
+		costs.indexTotalCost += 200.0 * cpu_operator_cost;
+
+		/*
+		 * In order to reliably get a cost estimation for the number of scans we have to do from root,
+		 * we need some estimation on the number of distinct prefixes that exist. Therefore, we need
+		 * a different selectivity approximation (this time we do need to use the clauses until the first
+		 * non-equality operator). Using that, we can estimate the number of groups
+		 */
+		for (i = 0; i < path->indexinfo->nkeycolumns && i < path->indexskipprefix; i++)
+		{
+			Expr *expr = NULL;
+			int attr = path->indexinfo->indexkeys[i];
+			if(attr > 0)
+			{
+				TargetEntry *tentry = get_tle_by_resno(path->indexinfo->indextlist, i + 1);
+				Assert(tentry != NULL);
+				expr = tentry->expr;
+			}
+			else if (attr == 0)
+			{
+				/* Expression index */
+				expr = lfirst(indexpr_item);
+				indexpr_item = lnext(path->indexinfo->indexprs, indexpr_item);
+			}
+			else /* attr < 0 */
+			{
+				/* Index on system column is not supported */
+				Assert(false);
+			}
+
+			exprlist = lappend(exprlist, expr);
+		}
+
+		selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
+
+		btreeSelectivity = clauselist_selectivity(root, selectivityQuals,
+												  index->rel->relid,
+												  JOIN_INNER,
+												  NULL);
+		estimatedIndexTuplesNoPrefix = btreeSelectivity * index->rel->tuples;
+
+		/*
+		 * As in genericcostestimate(), we have to adjust for any
+		 * ScalarArrayOpExpr quals included in prefixBoundQuals, and then round
+		 * to integer.
+		 */
+		estimatedIndexTuplesNoPrefix = rint(estimatedIndexTuplesNoPrefix / num_sa_scans);
+
+		numgroups_estimate = estimate_num_groups_internal(
+					root, exprlist, estimatedIndexTuplesNoPrefix,
+					estimatedIndexTuplesNoPrefix, NULL, NULL);
+
+		/*
+		 * For each distinct prefix value we add descending cost as.
+		 * This is similar to the startup cost calculation for regular scans.
+		 * We can do at most 2 scans from root per distinct prefix, so multiply by 2.
+		 * Also add some CPU processing cost per page that we need to process, plus
+		 * some additional one-time cost for scanning the leaf page. This is a more
+		 * expensive estimation than the per-page cpu cost for the regular index scan.
+		 * This is intentional, because the index skip scan does more processing on
+		 * the leaf page.
+		 */
+		if (index->tuples > 0)
+			descentCost = ceil(log(index->tuples) / log(2.0)) * cpu_operator_cost * 2;
+		else
+			descentCost = 0;
+		descentCost += (index->tree_height + 1) * 50.0 * cpu_operator_cost * 2 + 200 * cpu_operator_cost;
+		costs.indexTotalCost += costs.num_sa_scans * descentCost * numgroups_estimate;
+	}
+
 	/*
 	 * If we can get an estimate of the first column's ordering correlation C
 	 * from pg_statistic, estimate the index correlation as C for a
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 0343b2e1f6..da4c933166 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -49,7 +49,8 @@ extern IndexPath *create_index_path(PlannerInfo *root,
 									bool indexonly,
 									Relids required_outer,
 									double loop_count,
-									bool partial_path);
+									bool partial_path,
+									int skip_prefix);
 extern BitmapHeapPath *create_bitmap_heap_path(PlannerInfo *root,
 											   RelOptInfo *rel,
 											   Path *bitmapqual,
-- 
2.29.2

Dmitry Dolgov

9erthalion6@gmail.com

about 4 years ago

In reply to: Peter Geoghegan (#1)

Re: MDAM techniques and Index Skip Scan patch

On Thu, Oct 21, 2021 at 07:16:00PM -0700, Peter Geoghegan wrote:

My general concern is that the skip scan patch may currently be
structured in a way that paints us into a corner, MDAM-wise.

Note that the MDAM paper treats skipping a prefix of columns as a case
where the prefix is handled by pretending that there is a clause that
looks like this: "WHERE date between -inf AND +inf" -- which is not so
different from the original sales SQL query example that I have
highlighted. We don't tend to think of queries like this (like my
sales query) as in any way related to skip-scan, because we don't
imagine that there is any skipping going on. But maybe we should
recognize the similarities.

To avoid potential problems with extensibility in this sense, the
implementation needs to explicitly work with sets of disjoint intervals
of values instead of simple prefix size, one set of intervals per scan
key. An interesting idea, doesn't seem to be a big change in terms of
the patch itself.

Jesper Pedersen

jesper.pedersen@redhat.com

about 4 years ago

In reply to: Peter Geoghegan (#1)

Re: MDAM techniques and Index Skip Scan patch

Hi Peter,

On 10/21/21 22:16, Peter Geoghegan wrote:

I returned to the 1995 paper "Efficient Search of Multidimensional
B-Trees" [1] as part of the process of reviewing v39 of the skip scan
patch, which was posted back in May. It's a great paper, and anybody
involved in the skip scan effort should read it thoroughly (if they
haven't already). It's easy to see why people get excited about skip
scan [2]. But there is a bigger picture here.

Thanks for starting this thread !

The Index Skip Scan patch could affect a lot of areas, so keeping MDAM
in mind is definitely important.

However, I think the key part to progress on the "core" functionality
(B-tree related changes) is to get the planner functionality in place
first. Hopefully we can make progress on that during the November
CommitFest based on Andy's patch.

Best regards,
Jesper

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Floris Van Nee (#2)

Re: MDAM techniques and Index Skip Scan patch

Hi,

On Sat, Oct 23, 2021 at 07:30:47PM +0000, Floris Van Nee wrote:

From the patch series above, v9-0001/v9-0002 is the UniqueKeys patch series,
which focuses on the planner. v2-0001 is Dmitry's patch that extends it to
make it possible to use UniqueKeys for the skip scan. Both of these are
unfortunately still older patches, but because they are planner machinery
they are less relevant to the discussion about the executor here. Patch
v2-0002 contains most of my work and introduces all the executor logic for
the skip scan and hooks up the planner for DISTINCT queries to use the skip
scan. Patch v2-0003 is a planner hack that makes the planner pick a skip
scan on virtually every possibility. This also enables the skip scan on the
queries above that don't have a DISTINCT but where it could be useful.

The patchset doesn't apply anymore:
http://cfbot.cputube.org/patch_36_1741.log
=== Applying patches on top of PostgreSQL commit ID a18b6d2dc288dfa6e7905ede1d4462edd6a8af47 ===
=== applying patch ./v2-0001-Extend-UniqueKeys.patch
[...]
patching file src/include/optimizer/paths.h
Hunk #2 FAILED at 299.
1 out of 2 hunks FAILED -- saving rejects to file src/include/optimizer/paths.h.rej

Could you send a rebased version? In the meantime I will change the status on
the cf app to Waiting on Author.

Floris Van Nee

florisvannee@Optiver.com

almost 4 years ago

In reply to: Julien Rouhaud (#5)

5 attachment(s)

RE: MDAM techniques and Index Skip Scan patch

Could you send a rebased version? In the meantime I will change the status
on the cf app to Waiting on Author.

Attached a rebased version.

Attachments:

v9-0001-Introduce-RelOptInfo-notnullattrs-attribute.patchapplication/octet-stream; name=v9-0001-Introduce-RelOptInfo-notnullattrs-attribute.patchDownload

From fc2850a972be2ef771a2c02a3531d5fae9e38716 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Sun, 3 May 2020 22:37:46 +0800
Subject: [PATCH 1/5] Introduce RelOptInfo->notnullattrs attribute

The notnullattrs is calculated from catalog and run-time query. That
infomation is translated to child relation as well for partitioned
table.
---
 src/backend/optimizer/path/allpaths.c  | 31 ++++++++++++++++++++++++++
 src/backend/optimizer/plan/initsplan.c | 10 +++++++++
 src/backend/optimizer/util/plancat.c   | 10 +++++++++
 src/include/nodes/pathnodes.h          |  2 ++
 4 files changed, 53 insertions(+)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 169b1d53fc..c851995618 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -999,6 +999,7 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 		RelOptInfo *childrel;
 		ListCell   *parentvars;
 		ListCell   *childvars;
+		int i = -1;
 
 		/* append_rel_list contains all append rels; ignore others */
 		if (appinfo->parent_relid != parentRTindex)
@@ -1055,6 +1056,36 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 								   (Node *) rel->reltarget->exprs,
 								   1, &appinfo);
 
+		/* Copy notnullattrs. */
+		while ((i = bms_next_member(rel->notnullattrs, i)) > 0)
+		{
+			AttrNumber attno = i + FirstLowInvalidHeapAttributeNumber;
+			AttrNumber child_attno;
+			if (attno == 0)
+			{
+				/* Whole row is not null, so must be same for child */
+				childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+														attno - FirstLowInvalidHeapAttributeNumber);
+				break;
+			}
+			if (attno < 0 )
+				/* no need to translate system column */
+				child_attno = attno;
+			else
+			{
+				Node * node = list_nth(appinfo->translated_vars, attno - 1);
+				if (!IsA(node, Var))
+					/* This may happens at UNION case, like (SELECT a FROM t1 UNION SELECT a + 3
+					 * FROM t2) t and we know t.a is not null
+					 */
+					continue;
+				child_attno = castNode(Var, node)->varattno;
+			}
+
+			childrel->notnullattrs = bms_add_member(childrel->notnullattrs,
+													child_attno - FirstLowInvalidHeapAttributeNumber);
+		}
+
 		/*
 		 * We have to make child entries in the EquivalenceClass data
 		 * structures as well.  This is needed either if the parent
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index 023efbaf09..fb580df61e 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -831,6 +831,16 @@ deconstruct_recurse(PlannerInfo *root, Node *jtnode, bool below_outer_join,
 		{
 			Node	   *qual = (Node *) lfirst(l);
 
+			/* Set the not null info now */
+			ListCell	*lc;
+			List		*non_nullable_vars = find_nonnullable_vars(qual);
+			foreach(lc, non_nullable_vars)
+			{
+				Var *var = lfirst_node(Var, lc);
+				RelOptInfo *rel = root->simple_rel_array[var->varno];
+				rel->notnullattrs = bms_add_member(rel->notnullattrs,
+												   var->varattno - FirstLowInvalidHeapAttributeNumber);
+			}
 			distribute_qual_to_rels(root, qual,
 									below_outer_join, JOIN_INNER,
 									root->qual_security_level,
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 535fa041ad..0fdb8e9ada 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -117,6 +117,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	Relation	relation;
 	bool		hasindex;
 	List	   *indexinfos = NIL;
+	int			i;
 
 	/*
 	 * We need not lock the relation since it was already locked, either by
@@ -471,6 +472,15 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 	if (inhparent && relation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 		set_relation_partition_info(root, rel, relation);
 
+	Assert(rel->notnullattrs == NULL);
+	for(i = 0; i < relation->rd_att->natts; i++)
+	{
+		FormData_pg_attribute attr = relation->rd_att->attrs[i];
+		if (attr.attnotnull)
+			rel->notnullattrs = bms_add_member(rel->notnullattrs,
+											   attr.attnum - FirstLowInvalidHeapAttributeNumber);
+	}
+
 	table_close(relation, NoLock);
 
 	/*
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 1f33fe13c1..58b9ef71a7 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -727,6 +727,8 @@ typedef struct RelOptInfo
 	int			rel_parallel_workers;	/* wanted number of parallel workers */
 	uint32		amflags;		/* Bitmask of optional features supported by
 								 * the table AM */
+	/* Not null attrs, start from -FirstLowInvalidHeapAttributeNumber */
+	Bitmapset		*notnullattrs;
 
 	/* Information about foreign tables and foreign joins */
 	Oid			serverid;		/* identifies server for the table or join */
-- 
2.33.1

v9-0002-Introduce-UniqueKey-attributes-on-RelOptInfo-stru.patchapplication/octet-stream; name=v9-0002-Introduce-UniqueKey-attributes-on-RelOptInfo-stru.patchDownload

From 7b2041fb894bddef250ab7fb97733d881f2e0904 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=E4=B8=80=E6=8C=83?= <yizhi.fzh@alibaba-inc.com>
Date: Mon, 11 May 2020 15:50:52 +0800
Subject: [PATCH 2/5] Introduce UniqueKey attributes on RelOptInfo struct.

UniqueKey is a set of exprs on RelOptInfo which represents the exprs
will be unique on the given RelOptInfo. You can see README.uniquekey
for more information.
---
 src/backend/nodes/copyfuncs.c               |   13 +
 src/backend/nodes/list.c                    |   31 +
 src/backend/nodes/makefuncs.c               |   13 +
 src/backend/nodes/outfuncs.c                |   11 +
 src/backend/nodes/readfuncs.c               |   10 +
 src/backend/optimizer/path/Makefile         |    3 +-
 src/backend/optimizer/path/README.uniquekey |  131 +++
 src/backend/optimizer/path/allpaths.c       |   10 +
 src/backend/optimizer/path/joinpath.c       |    9 +-
 src/backend/optimizer/path/joinrels.c       |    2 +
 src/backend/optimizer/path/pathkeys.c       |    3 +-
 src/backend/optimizer/path/uniquekeys.c     | 1134 +++++++++++++++++++
 src/backend/optimizer/plan/planner.c        |   15 +-
 src/backend/optimizer/prep/prepunion.c      |    2 +
 src/backend/optimizer/util/appendinfo.c     |   44 +
 src/backend/optimizer/util/inherit.c        |   18 +-
 src/include/nodes/makefuncs.h               |    3 +
 src/include/nodes/nodes.h                   |    1 +
 src/include/nodes/pathnodes.h               |   29 +-
 src/include/nodes/pg_list.h                 |    1 +
 src/include/optimizer/appendinfo.h          |    3 +
 src/include/optimizer/optimizer.h           |    2 +
 src/include/optimizer/paths.h               |   43 +
 23 files changed, 1506 insertions(+), 25 deletions(-)
 create mode 100644 src/backend/optimizer/path/README.uniquekey
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 456d563f34..fa927a3044 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2323,6 +2323,16 @@ _copyPathKey(const PathKey *from)
 	return newnode;
 }
 
+static UniqueKey *
+_copyUniqueKey(const UniqueKey *from)
+{
+	UniqueKey	*newnode = makeNode(UniqueKey);
+
+	COPY_NODE_FIELD(exprs);
+	COPY_SCALAR_FIELD(multi_nullvals);
+
+	return newnode;
+}
 /*
  * _copyRestrictInfo
  */
@@ -5331,6 +5341,9 @@ copyObjectImpl(const void *from)
 		case T_PathKey:
 			retval = _copyPathKey(from);
 			break;
+		case T_UniqueKey:
+			retval = _copyUniqueKey(from);
+			break;
 		case T_RestrictInfo:
 			retval = _copyRestrictInfo(from);
 			break;
diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index f843f861ef..83f9d37dc5 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -714,6 +714,37 @@ list_member_oid(const List *list, Oid datum)
 	return false;
 }
 
+/*
+ * return true iff every entry in "members" list is also present
+ * in the "target" list.
+ */
+bool
+list_is_subset(const List *members, const List *target)
+{
+	const ListCell	*lc1, *lc2;
+
+	Assert(IsPointerList(members));
+	Assert(IsPointerList(target));
+	check_list_invariants(members);
+	check_list_invariants(target);
+
+	foreach(lc1, members)
+	{
+		bool found = false;
+		foreach(lc2, target)
+		{
+			if (equal(lfirst(lc1), lfirst(lc2)))
+			{
+				found = true;
+				break;
+			}
+		}
+		if (!found)
+			return false;
+	}
+	return true;
+}
+
 /*
  * Delete the n'th cell (counting from 0) in list.
  *
diff --git a/src/backend/nodes/makefuncs.c b/src/backend/nodes/makefuncs.c
index 822395625b..2295a95d97 100644
--- a/src/backend/nodes/makefuncs.c
+++ b/src/backend/nodes/makefuncs.c
@@ -817,3 +817,16 @@ makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols)
 	v->va_cols = va_cols;
 	return v;
 }
+
+
+/*
+ * makeUniqueKey
+ */
+UniqueKey*
+makeUniqueKey(List *exprs, bool multi_nullvals)
+{
+	UniqueKey * ukey = makeNode(UniqueKey);
+	ukey->exprs = exprs;
+	ukey->multi_nullvals = multi_nullvals;
+	return ukey;
+}
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c0bf27d28b..2369d26c8c 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2515,6 +2515,14 @@ _outPathKey(StringInfo str, const PathKey *node)
 	WRITE_BOOL_FIELD(pk_nulls_first);
 }
 
+static void
+_outUniqueKey(StringInfo str, const UniqueKey *node)
+{
+	WRITE_NODE_TYPE("UNIQUEKEY");
+	WRITE_NODE_FIELD(exprs);
+	WRITE_BOOL_FIELD(multi_nullvals);
+}
+
 static void
 _outPathTarget(StringInfo str, const PathTarget *node)
 {
@@ -4299,6 +4307,9 @@ outNode(StringInfo str, const void *obj)
 			case T_PathKey:
 				_outPathKey(str, obj);
 				break;
+			case T_UniqueKey:
+				_outUniqueKey(str, obj);
+				break;
 			case T_PathTarget:
 				_outPathTarget(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 3f68f7c18d..7b1a2a397c 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -492,6 +492,14 @@ _readSetOperationStmt(void)
 	READ_DONE();
 }
 
+static UniqueKey *
+_readUniqueKey(void)
+{
+	READ_LOCALS(UniqueKey);
+	READ_NODE_FIELD(exprs);
+	READ_BOOL_FIELD(multi_nullvals);
+	READ_DONE();
+}
 
 /*
  *	Stuff from primnodes.h.
@@ -2746,6 +2754,8 @@ parseNodeString(void)
 		return_value = _readCommonTableExpr();
 	else if (MATCH("SETOPERATIONSTMT", 16))
 		return_value = _readSetOperationStmt();
+	else if (MATCH("UNIQUEKEY", 9))
+		return_value = _readUniqueKey();
 	else if (MATCH("ALIAS", 5))
 		return_value = _readAlias();
 	else if (MATCH("RANGEVAR", 8))
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..7b9820c25f 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekeys.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/README.uniquekey b/src/backend/optimizer/path/README.uniquekey
new file mode 100644
index 0000000000..5eac761995
--- /dev/null
+++ b/src/backend/optimizer/path/README.uniquekey
@@ -0,0 +1,131 @@
+1. What is UniqueKey?
+We can think UniqueKey is a set of exprs for a RelOptInfo, which we are insure
+that doesn't yields same result among all the rows. The simplest UniqueKey
+format is primary key.
+
+However we define the UnqiueKey as below.
+
+typedef struct UniqueKey
+{
+        NodeTag	type;
+        List	*exprs;
+        bool	multi_nullvals;
+} UniqueKey;
+
+exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+is a special case of UniqueKey, which means there is only one row in that
+relation.it has a stronger semantic than others. like SELECT uk FROM t; uk is
+normal unique key and may have different values. SELECT colx FROM t WHERE uk =
+const.  colx is unique AND we have only 1 value. This field can used for
+innerrel_is_unique. this logic is handled specially in add_uniquekey_for_onerow
+function.
+
+multi_nullvals: true means multi null values may exist in these exprs, so the
+uniqueness is not guaranteed in this case. This field is necessary for
+remove_useless_join & reduce_unique_semijoins where we don't mind these
+duplicated NULL values. It is set to true for 2 cases. One is a unique key
+from a unique index but the related column is nullable. The other one is for
+outer join. see populate_joinrel_uniquekeys for detail.
+
+
+The UniqueKey can be used at the following cases at least:
+1. remove_useless_joins.
+2. reduce_semianti_joins
+3. remove distinct node if distinct clause is unique.
+4. remove aggnode if group by clause is unique.
+5. Index Skip Scan (WIP)
+6. Aggregation Push Down without 2 phase aggregation if the join can't
+   duplicated the aggregated rows. (work in progress feature)
+
+2. How is it maintained?
+
+We have a set of populate_xxx_unqiuekeys functions to maintain the uniquekey on
+various cases. xxx includes baserel, joinrel, partitionedrel, distinctrel,
+groupedrel, unionrel. and we also need to convert the uniquekey from subquery
+to outer relation, which is what convert_subquery_uniquekeys does.
+
+1. The first part is about baserel. We handled 3 cases. suppose we have Unique
+Index on (a, b).
+
+1. SELECT a, b FROM t.  UniqueKey (a, b)
+2. SELECT a FROM t WHERE b = 1;  UniqueKey (a)
+3. SELECT .. FROM t WHERE a = 1 AND b = 1;  UniqueKey (NIL).  onerow case, every
+   column is Unique.
+
+2. The next part is joinrel, this part is most error-prone, we simplified the rules
+like below:
+1. If the relation's UniqueKey can't be duplicated after join,  then is will be
+   still valid for the join rel. The function we used here is
+   innerrel_keeps_unique. The basic idea is innerrel.any_col = outer.uk.
+
+2. If the UnqiueKey can't keep valid via the rule 1, the combination of the
+   UniqueKey from both sides are valid for sure.  We can prove this as: if the
+   unique exprs from rel1 is duplicated by rel2, the duplicated rows must
+   contains different unique exprs from rel2.
+
+More considerations about onerow:
+1. If relation with one row and it can't be duplicated, it is still possible
+   contains mulit_nullvas after outer join.
+2. If the either UniqueKey can be duplicated after join, the can get one row
+   only when both side is one row AND there is no outer join.
+3. Whenever the onerow UniqueKey is not a valid any more, we need to convert one
+   row UniqueKey to normal unique key since we don't store exprs for one-row
+   relation. get_exprs_from_uniquekeys will be used here.
+
+
+More considerations about multi_nullvals after join:
+1. If the original UnqiueKey has multi_nullvals, the final UniqueKey will have
+   mulit_nullvals in any case.
+2. If a unique key doesn't allow mulit_nullvals, after some outer join, it
+   allows some outer join.
+
+
+3. When we comes to subquery, we need to convert_subquery_unqiuekeys just like
+convert_subquery_pathkeys.  Only the UniqueKey insides subquery is referenced as
+a Var in outer relation will be reused. The relationship between the outerrel.Var
+and subquery.exprs is built with outerel->subroot->processed_tlist.
+
+
+4. As for the SRF functions, it will break the uniqueness of uniquekey, However it
+is handled in adjust_paths_for_srfs, which happens after the query_planner. so
+we will maintain the UniqueKey until there and reset it to NIL at that
+places. This can't help on distinct/group by elimination cases but probably help
+in some other cases, like reduce_unqiue_semijoins/remove_useless_joins and it is
+semantic correctly.
+
+
+5. As for inherit table, we first main the UnqiueKey on childrel as well. But for
+partitioned table we need to maintain 2 different kinds of
+UnqiueKey. 1). UniqueKey on the parent relation 2). UniqueKey on child
+relation for partition wise query.
+
+Example:
+CREATE TABLE p (a int not null, b int not null) partition by list (a);
+CREATE TABLE p0 partition of p for values in (1);
+CREATE TABLE p1 partition of p for values in (2);
+
+create unique index p0_b on p0(b);
+create unique index p1_b on p1(b);
+
+Now b is only unique on partition level, so the distinct can't be removed on
+the following cases. SELECT DISTINCT b FROM p;
+
+Another example is SELECT DISTINCT a, b FROM p WHERE a = 1; Since only one
+partition is chosen, the UniqueKey on child relation is same as the UniqueKey on
+parent relation.
+
+Another usage of UniqueKey on partition level is it be helpful for
+partition-wise join.
+
+As for the UniqueKey on parent table level, it comes with 2 different ways,
+1). the UniqueKey is also derived in UniqueKey index, but the index must be same
+in all the related children relations and the unique index must contains
+Partition Key in it. Example:
+
+CREATE UNIQUE INDEX p_ab ON p(a, b);  -- where a is the partition key.
+
+-- Query
+SELECT a, b FROM p; the (a, b) is a UniqueKey of p.
+
+2). If the parent relation has only one childrel, the UniqueKey on childrel is
+ the UniqueKey on parent as well.
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index c851995618..ac95015b56 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -580,6 +580,12 @@ set_plain_rel_size(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
 	 */
 	check_index_predicates(root, rel);
 
+	/*
+	 * Now that we've marked which partial indexes are suitable, we can now
+	 * build the relation's unique keys.
+	 */
+	populate_baserel_uniquekeys(root, rel, rel->indexlist);
+
 	/* Mark rel with estimated output rows, width, etc */
 	set_baserel_size_estimates(root, rel);
 }
@@ -1298,6 +1304,8 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 
 	/* Add paths to the append relation. */
 	add_paths_to_append_rel(root, rel, live_childrels);
+	if (IS_PARTITIONED_REL(rel))
+		populate_partitionedrel_uniquekeys(root, rel, live_childrels);
 }
 
 
@@ -2373,6 +2381,8 @@ set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
 										  pathkeys, required_outer));
 	}
 
+	convert_subquery_uniquekeys(root, rel, sub_final_rel);
+
 	/* If outer rel allows parallelism, do same for partial paths. */
 	if (rel->consider_parallel && bms_is_empty(required_outer))
 	{
diff --git a/src/backend/optimizer/path/joinpath.c b/src/backend/optimizer/path/joinpath.c
index f96fc9fd28..5ea9d7ffde 100644
--- a/src/backend/optimizer/path/joinpath.c
+++ b/src/backend/optimizer/path/joinpath.c
@@ -77,13 +77,6 @@ static void consider_parallel_mergejoin(PlannerInfo *root,
 static void hash_inner_and_outer(PlannerInfo *root, RelOptInfo *joinrel,
 								 RelOptInfo *outerrel, RelOptInfo *innerrel,
 								 JoinType jointype, JoinPathExtraData *extra);
-static List *select_mergejoin_clauses(PlannerInfo *root,
-									  RelOptInfo *joinrel,
-									  RelOptInfo *outerrel,
-									  RelOptInfo *innerrel,
-									  List *restrictlist,
-									  JoinType jointype,
-									  bool *mergejoin_allowed);
 static void generate_mergejoin_paths(PlannerInfo *root,
 									 RelOptInfo *joinrel,
 									 RelOptInfo *innerrel,
@@ -2218,7 +2211,7 @@ hash_inner_and_outer(PlannerInfo *root,
  * if it is mergejoinable and involves vars from the two sub-relations
  * currently of interest.
  */
-static List *
+List *
 select_mergejoin_clauses(PlannerInfo *root,
 						 RelOptInfo *joinrel,
 						 RelOptInfo *outerrel,
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 9da3ff2f9a..789c76af9d 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -924,6 +924,8 @@ populate_joinrel_with_paths(PlannerInfo *root, RelOptInfo *rel1,
 
 	/* Apply partitionwise join technique, if possible. */
 	try_partitionwise_join(root, rel1, rel2, joinrel, sjinfo, restrictlist);
+
+	populate_joinrel_uniquekeys(root, joinrel, rel1, rel2, restrictlist, sjinfo->jointype);
 }
 
 
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 86a35cdef1..9022c77dac 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -33,7 +33,6 @@ static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
 											 int partkeycol);
-static Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 static bool right_merge_direction(PlannerInfo *root, PathKey *pathkey);
 
 
@@ -1035,7 +1034,7 @@ convert_subquery_pathkeys(PlannerInfo *root, RelOptInfo *rel,
  * We need this to ensure that we don't return pathkeys describing values
  * that are unavailable above the level of the subquery scan.
  */
-static Var *
+Var *
 find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle)
 {
 	ListCell   *lc;
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
new file mode 100644
index 0000000000..ca40c40858
--- /dev/null
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -0,0 +1,1134 @@
+/*-------------------------------------------------------------------------
+ *
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/uniquekeys.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
+#include "optimizer/pathnode.h"
+#include "optimizer/paths.h"
+#include "optimizer/appendinfo.h"
+#include "optimizer/optimizer.h"
+#include "optimizer/tlist.h"
+#include "rewrite/rewriteManip.h"
+
+
+/*
+ * This struct is used to help populate_joinrel_uniquekeys.
+ *
+ * added_to_joinrel is true if a uniquekey (from outerrel or innerrel)
+ * has been added to joinrel.
+ * useful is true if the exprs of the uniquekey still appears in joinrel.
+ */
+typedef struct UniqueKeyContextData
+{
+	UniqueKey	*uniquekey;
+	bool	added_to_joinrel;
+	bool	useful;
+} *UniqueKeyContext;
+
+static List *initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel);
+static bool innerrel_keeps_unique(PlannerInfo *root,
+								  RelOptInfo *outerrel,
+								  RelOptInfo *innerrel,
+								  List *restrictlist,
+								  bool reverse);
+
+static List *get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+										List *const_exprs,
+										List *const_expr_opfamilies,
+										Bitmapset *used_varattrs,
+										bool *useful,
+										bool *multi_nullvals);
+static List *get_exprs_from_uniquekey(PlannerInfo *root,
+									  RelOptInfo *joinrel,
+									  RelOptInfo *rel1,
+									  UniqueKey *ukey);
+static void add_uniquekey_for_onerow(RelOptInfo *rel);
+static bool add_combined_uniquekey(PlannerInfo *root,
+								   RelOptInfo *joinrel,
+								   RelOptInfo *outer_rel,
+								   RelOptInfo *inner_rel,
+								   UniqueKey *outer_ukey,
+								   UniqueKey *inner_ukey,
+								   JoinType jointype);
+
+/* Used for unique indexes checking for partitioned table */
+static bool index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind);
+static IndexOptInfo *simple_copy_indexinfo_to_parent(PlannerInfo *root,
+													 RelOptInfo *parentrel,
+													 IndexOptInfo *from);
+static bool simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2);
+static void adjust_partition_unique_indexlist(PlannerInfo *root,
+											  RelOptInfo *parentrel,
+											  RelOptInfo *childrel,
+											  List **global_unique_index);
+
+/* Helper function for grouped relation and distinct relation. */
+static void add_uniquekey_from_sortgroups(PlannerInfo *root,
+										  RelOptInfo *rel,
+										  List *sortgroups);
+
+/*
+ * populate_baserel_uniquekeys
+ *		Populate 'baserel' uniquekeys list by looking at the rel's unique index
+ * and baserestrictinfo
+ */
+void
+populate_baserel_uniquekeys(PlannerInfo *root,
+							RelOptInfo *baserel,
+							List *indexlist)
+{
+	ListCell *lc;
+	List	*matched_uniq_indexes = NIL;
+
+	/* Attrs appears in rel->reltarget->exprs. */
+	Bitmapset *used_attrs = NULL;
+
+	List	*const_exprs = NIL;
+	List	*expr_opfamilies = NIL;
+
+	Assert(baserel->rtekind == RTE_RELATION);
+
+	foreach(lc, indexlist)
+	{
+		IndexOptInfo *ind = (IndexOptInfo *) lfirst(lc);
+		if (!ind->unique || !ind->immediate ||
+			(ind->indpred != NIL && !ind->predOK))
+			continue;
+		matched_uniq_indexes = lappend(matched_uniq_indexes, ind);
+	}
+
+	if (matched_uniq_indexes  == NIL)
+		return;
+
+	/* Check which attrs is used in baserel->reltarget */
+	pull_varattnos((Node *)baserel->reltarget->exprs, baserel->relid, &used_attrs);
+
+	/* Check which attrno is used at a mergeable const filter */
+	foreach(lc, baserel->baserestrictinfo)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);
+
+		if (rinfo->mergeopfamilies == NIL)
+			continue;
+
+		if (bms_is_empty(rinfo->left_relids))
+		{
+			const_exprs = lappend(const_exprs, get_rightop(rinfo->clause));
+		}
+		else if (bms_is_empty(rinfo->right_relids))
+		{
+			const_exprs = lappend(const_exprs, get_leftop(rinfo->clause));
+		}
+		else
+			continue;
+
+		expr_opfamilies = lappend(expr_opfamilies, rinfo->mergeopfamilies);
+	}
+
+	foreach(lc, matched_uniq_indexes)
+	{
+		bool	multi_nullvals, useful;
+		List	*exprs = get_exprs_from_uniqueindex(lfirst_node(IndexOptInfo, lc),
+													const_exprs,
+													expr_opfamilies,
+													used_attrs,
+													&useful,
+													&multi_nullvals);
+		if (useful)
+		{
+			if (exprs == NIL)
+			{
+				/* All the columns in Unique Index matched with a restrictinfo */
+				add_uniquekey_for_onerow(baserel);
+				return;
+			}
+			baserel->uniquekeys = lappend(baserel->uniquekeys,
+										  makeUniqueKey(exprs, multi_nullvals));
+		}
+	}
+}
+
+
+/*
+ * populate_partitionedrel_uniquekeys
+ * The UniqueKey on partitionrel comes from 2 cases:
+ * 1). Only one partition is involved in this query, the unique key can be
+ * copied to parent rel from childrel.
+ * 2). There are some unique index which includes partition key and exists
+ * in all the related partitions.
+ * We never mind rule 2 if we hit rule 1.
+ */
+
+void
+populate_partitionedrel_uniquekeys(PlannerInfo *root,
+								   RelOptInfo *rel,
+								   List *childrels)
+{
+	ListCell	*lc;
+	List	*global_uniq_indexlist = NIL;
+	RelOptInfo *childrel;
+	bool is_first = true;
+
+	Assert(IS_PARTITIONED_REL(rel));
+
+	if (childrels == NIL)
+		return;
+
+	/*
+	 * If there is only one partition used in this query, the UniqueKey in childrel is
+	 * still valid in parent level, but we need convert the format from child expr to
+	 * parent expr.
+	 */
+	if (list_length(childrels) == 1)
+	{
+		/* Check for Rule 1 */
+		RelOptInfo *childrel = linitial_node(RelOptInfo, childrels);
+		ListCell	*lc;
+		Assert(childrel->reloptkind == RELOPT_OTHER_MEMBER_REL);
+		if (relation_is_onerow(childrel))
+		{
+			add_uniquekey_for_onerow(rel);
+			return;
+		}
+
+		foreach(lc, childrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			AppendRelInfo *appinfo = find_appinfo_by_child(root, childrel->relid);
+			List *parent_exprs = NIL;
+			bool can_reuse = true;
+			ListCell	*lc2;
+			foreach(lc2, ukey->exprs)
+			{
+				Var *var = (Var *)lfirst(lc2);
+				/*
+				 * If the expr comes from a expression, it is hard to build the expression
+				 * in parent so ignore that case for now.
+				 */
+				if(!IsA(var, Var))
+				{
+					can_reuse = false;
+					break;
+				}
+				/* Convert it to parent var */
+				parent_exprs = lappend(parent_exprs, find_parent_var(appinfo, var));
+			}
+			if (can_reuse)
+				rel->uniquekeys = lappend(rel->uniquekeys,
+										  makeUniqueKey(parent_exprs,
+														ukey->multi_nullvals));
+		}
+	}
+	else
+	{
+		/* Check for rule 2 */
+		childrel = linitial_node(RelOptInfo, childrels);
+		foreach(lc, childrel->indexlist)
+		{
+			IndexOptInfo *ind = lfirst(lc);
+			IndexOptInfo *modified_index;
+			if (!ind->unique || !ind->immediate ||
+				(ind->indpred != NIL && !ind->predOK))
+				continue;
+
+			/*
+			 * During simple_copy_indexinfo_to_parent, we need to convert var from
+			 * child var to parent var, index on expression is too complex to handle.
+			 * so ignore it for now.
+			 */
+			if (ind->indexprs != NIL)
+				continue;
+
+			modified_index = simple_copy_indexinfo_to_parent(root, rel, ind);
+			/*
+			 * If the unique index doesn't contain partkey, then it is unique
+			 * on this partition only, so it is useless for us.
+			 */
+			if (!index_constains_partkey(rel, modified_index))
+				continue;
+
+			global_uniq_indexlist = lappend(global_uniq_indexlist,  modified_index);
+		}
+
+		if (global_uniq_indexlist != NIL)
+		{
+			foreach(lc, childrels)
+			{
+				RelOptInfo *child = lfirst(lc);
+				if (is_first)
+				{
+					is_first = false;
+					continue;
+				}
+				adjust_partition_unique_indexlist(root, rel, child, &global_uniq_indexlist);
+			}
+			/* Now we have a list of unique index which are exactly same on all childrels,
+			 * Set the UniqueKey just like it is non-partition table
+			 */
+			populate_baserel_uniquekeys(root, rel, global_uniq_indexlist);
+		}
+	}
+}
+
+
+/*
+ * populate_distinctrel_uniquekeys
+ */
+void
+populate_distinctrel_uniquekeys(PlannerInfo *root,
+								RelOptInfo *inputrel,
+								RelOptInfo *distinctrel)
+{
+	/* The unique key before the distinct is still valid. */
+	distinctrel->uniquekeys = list_copy(inputrel->uniquekeys);
+	add_uniquekey_from_sortgroups(root, distinctrel, root->parse->distinctClause);
+}
+
+/*
+ * populate_grouprel_uniquekeys
+ */
+void
+populate_grouprel_uniquekeys(PlannerInfo *root,
+							 RelOptInfo *grouprel,
+							 RelOptInfo *inputrel)
+
+{
+	Query *parse = root->parse;
+	bool input_ukey_added = false;
+	ListCell *lc;
+
+	if (relation_is_onerow(inputrel))
+	{
+		add_uniquekey_for_onerow(grouprel);
+		return;
+	}
+	if (parse->groupingSets)
+		return;
+
+	/* A Normal group by without grouping set. */
+	if (parse->groupClause)
+	{
+		/*
+		 * Current even the groupby clause is Unique already, but if query has aggref
+		 * We have to create grouprel still. To keep the UnqiueKey short, we will check
+		 * the UniqueKey of input_rel still valid, if so we reuse it.
+		 */
+		foreach(lc, inputrel->uniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(ukey->exprs, grouprel->reltarget->exprs))
+			{
+				grouprel->uniquekeys = lappend(grouprel->uniquekeys,
+											   ukey);
+				input_ukey_added = true;
+			}
+		}
+		if (!input_ukey_added)
+			/*
+			 * group by clause must be a super-set of grouprel->reltarget->exprs except the
+			 * aggregation expr, so if such exprs is unique already, no bother to generate
+			 * new uniquekey for group by exprs.
+			 */
+			add_uniquekey_from_sortgroups(root,
+										  grouprel,
+										  root->parse->groupClause);
+	}
+	else
+		/* It has aggregation but without a group by, so only one row returned */
+		add_uniquekey_for_onerow(grouprel);
+}
+
+/*
+ * simple_copy_uniquekeys
+ * Using a function for the one-line code makes us easy to check where we simply
+ * copied the uniquekey.
+ */
+void
+simple_copy_uniquekeys(RelOptInfo *oldrel,
+					   RelOptInfo *newrel)
+{
+	newrel->uniquekeys = oldrel->uniquekeys;
+}
+
+/*
+ *  populate_unionrel_uniquekeys
+ */
+void
+populate_unionrel_uniquekeys(PlannerInfo *root,
+							  RelOptInfo *unionrel)
+{
+	ListCell	*lc;
+	List	*exprs = NIL;
+
+	Assert(unionrel->uniquekeys == NIL);
+
+	foreach(lc, unionrel->reltarget->exprs)
+	{
+		exprs = lappend(exprs, lfirst(lc));
+	}
+
+	if (exprs == NIL)
+		/* SQL: select union select; is valid, we need to handle it here. */
+		add_uniquekey_for_onerow(unionrel);
+	else
+		unionrel->uniquekeys = lappend(unionrel->uniquekeys,
+									   makeUniqueKey(exprs,false));
+
+}
+
+/*
+ * populate_joinrel_uniquekeys
+ *
+ * populate uniquekeys for joinrel. We will check each relation to see if its
+ * UniqueKey is still valid via innerrel_keeps_unique, if so, we add it to
+ * joinrel.  The multi_nullvals field will be changed to true for some outer
+ * join cases and one-row UniqueKey needs to be converted to normal UniqueKey
+ * for the same case as well.
+ * For the uniquekey in either baserel which can't be unique after join, we still
+ * check to see if combination of UniqueKeys from both side is still useful for us.
+ * if yes, we add it to joinrel as well.
+ */
+void
+populate_joinrel_uniquekeys(PlannerInfo *root, RelOptInfo *joinrel,
+							RelOptInfo *outerrel, RelOptInfo *innerrel,
+							List *restrictlist, JoinType jointype)
+{
+	ListCell *lc, *lc2;
+	List	*clause_list = NIL;
+	List	*outerrel_ukey_ctx;
+	List	*innerrel_ukey_ctx;
+	bool	inner_onerow, outer_onerow;
+	bool	mergejoin_allowed;
+
+	/* Care about the outerrel relation only for SEMI/ANTI join */
+	if (jointype == JOIN_SEMI || jointype == JOIN_ANTI)
+	{
+		foreach(lc, outerrel->uniquekeys)
+		{
+			UniqueKey	*uniquekey = lfirst_node(UniqueKey, lc);
+			if (list_is_subset(uniquekey->exprs, joinrel->reltarget->exprs))
+				joinrel->uniquekeys = lappend(joinrel->uniquekeys, uniquekey);
+		}
+		return;
+	}
+
+	Assert(jointype == JOIN_LEFT || jointype == JOIN_FULL || jointype == JOIN_INNER);
+
+	/* Fast path */
+	if (innerrel->uniquekeys == NIL || outerrel->uniquekeys == NIL)
+		return;
+
+	inner_onerow = relation_is_onerow(innerrel);
+	outer_onerow = relation_is_onerow(outerrel);
+
+	outerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(outerrel);
+	innerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(innerrel);
+
+	clause_list = select_mergejoin_clauses(root, joinrel, outerrel, innerrel,
+										   restrictlist, jointype,
+										   &mergejoin_allowed);
+
+	if (innerrel_keeps_unique(root, innerrel, outerrel, clause_list, true /* reverse */))
+	{
+		bool outer_impact = jointype == JOIN_FULL;
+		foreach(lc, outerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			/* Outer relation has one row, and the unique key is not duplicated after join,
+			 * the joinrel will still has one row unless the jointype == JOIN_FULL.
+			 */
+			if (outer_onerow && !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (outer_onerow)
+			{
+				/*
+				 * The onerow outerrel becomes multi rows and multi_nullvals
+				 * will be changed to true. We also need to set the exprs correctly since it
+				 * can't be NIL any more.
+				 */
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(root, joinrel, outerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Change multi_nullvals to true due to the full join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs, true));
+				else
+					/* Just reuse it */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	if (innerrel_keeps_unique(root, outerrel, innerrel, clause_list, false))
+	{
+		bool outer_impact = jointype == JOIN_FULL || jointype == JOIN_LEFT;;
+
+		foreach(lc, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+
+			if (!list_is_subset(ctx->uniquekey->exprs, joinrel->reltarget->exprs))
+			{
+				ctx->useful = false;
+				continue;
+			}
+
+			if (inner_onerow &&  !outer_impact)
+			{
+				add_uniquekey_for_onerow(joinrel);
+				return;
+			}
+			else if (inner_onerow)
+			{
+				ListCell *lc2;
+				foreach(lc2, get_exprs_from_uniquekey(root, joinrel, innerrel, NULL))
+				{
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(lfirst(lc2), true));
+				}
+			}
+			else
+			{
+				if (!ctx->uniquekey->multi_nullvals && outer_impact)
+					/* Need to change multi_nullvals to true due to the outer join. */
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  makeUniqueKey(ctx->uniquekey->exprs,
+																true));
+				else
+					joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+												  ctx->uniquekey);
+
+			}
+			ctx->added_to_joinrel = true;
+		}
+	}
+
+	/*
+	 * The combination of the UniqueKey from both sides is unique as well regardless
+	 * of join type, but no bother to add it if its subset has been added to joinrel
+	 * already or it is not useful for the joinrel.
+	 */
+	foreach(lc, outerrel_ukey_ctx)
+	{
+		UniqueKeyContext ctx1 = (UniqueKeyContext) lfirst(lc);
+		if (ctx1->added_to_joinrel || !ctx1->useful)
+			continue;
+		foreach(lc2, innerrel_ukey_ctx)
+		{
+			UniqueKeyContext ctx2 = (UniqueKeyContext) lfirst(lc2);
+			if (ctx2->added_to_joinrel || !ctx2->useful)
+				continue;
+			if (add_combined_uniquekey(root, joinrel, outerrel, innerrel,
+									   ctx1->uniquekey, ctx2->uniquekey,
+									   jointype))
+				/* If we set a onerow UniqueKey to joinrel, we don't need other. */
+				return;
+		}
+	}
+}
+
+
+/*
+ * convert_subquery_uniquekeys
+ *
+ * Covert the UniqueKey in subquery to outer relation.
+ */
+void convert_subquery_uniquekeys(PlannerInfo *root,
+								 RelOptInfo *currel,
+								 RelOptInfo *sub_final_rel)
+{
+	ListCell	*lc;
+
+	if (sub_final_rel->uniquekeys == NIL)
+		return;
+
+	if (relation_is_onerow(sub_final_rel))
+	{
+		add_uniquekey_for_onerow(currel);
+		return;
+	}
+
+	Assert(currel->subroot != NULL);
+
+	foreach(lc, sub_final_rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		ListCell	*lc;
+		List	*exprs = NIL;
+		bool	ukey_useful = true;
+
+		/* One row case is handled above */
+		Assert(ukey->exprs != NIL);
+		foreach(lc, ukey->exprs)
+		{
+			Var *var;
+			TargetEntry *tle = tlist_member(lfirst(lc),
+											currel->subroot->processed_tlist);
+			if (tle == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			var = find_var_for_subquery_tle(currel, tle);
+			if (var == NULL)
+			{
+				ukey_useful = false;
+				break;
+			}
+			exprs = lappend(exprs, var);
+		}
+
+		if (ukey_useful)
+			currel->uniquekeys = lappend(currel->uniquekeys,
+										 makeUniqueKey(exprs,
+													   ukey->multi_nullvals));
+
+	}
+}
+
+/*
+ * innerrel_keeps_unique
+ *
+ * Check if Unique key of the innerrel is valid after join. innerrel's UniqueKey
+ * will be still valid if innerrel's any-column mergeop outrerel's uniquekey
+ * exists in clause_list.
+ *
+ * Note: the clause_list must be a list of mergeable restrictinfo already.
+ */
+static bool
+innerrel_keeps_unique(PlannerInfo *root,
+					  RelOptInfo *outerrel,
+					  RelOptInfo *innerrel,
+					  List *clause_list,
+					  bool reverse)
+{
+	ListCell	*lc, *lc2, *lc3;
+
+	if (outerrel->uniquekeys == NIL || innerrel->uniquekeys == NIL)
+		return false;
+
+	/* Check if there is outerrel's uniquekey in mergeable clause. */
+	foreach(lc, outerrel->uniquekeys)
+	{
+		List	*outer_uq_exprs = lfirst_node(UniqueKey, lc)->exprs;
+		bool clauselist_matchs_all_exprs = true;
+		foreach(lc2, outer_uq_exprs)
+		{
+			Node *outer_uq_expr = lfirst(lc2);
+			bool find_uq_expr_in_clauselist = false;
+			foreach(lc3, clause_list)
+			{
+				RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc3);
+				Node *outer_expr;
+				if (reverse)
+					outer_expr = rinfo->outer_is_left ? get_rightop(rinfo->clause) : get_leftop(rinfo->clause);
+				else
+					outer_expr = rinfo->outer_is_left ? get_leftop(rinfo->clause) : get_rightop(rinfo->clause);
+				if (equal(outer_expr, outer_uq_expr))
+				{
+					find_uq_expr_in_clauselist = true;
+					break;
+				}
+			}
+			if (!find_uq_expr_in_clauselist)
+			{
+				/* No need to check the next exprs in the current uniquekey */
+				clauselist_matchs_all_exprs = false;
+				break;
+			}
+		}
+
+		if (clauselist_matchs_all_exprs)
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * relation_is_onerow
+ * Check if it is a one-row relation by checking UniqueKey.
+ */
+bool
+relation_is_onerow(RelOptInfo *rel)
+{
+	UniqueKey *ukey;
+	if (rel->uniquekeys == NIL)
+		return false;
+	ukey = linitial_node(UniqueKey, rel->uniquekeys);
+	return ukey->exprs == NIL && list_length(rel->uniquekeys) == 1;
+}
+
+/*
+ * relation_has_uniquekeys_for
+ *		Returns true if we have proofs that 'rel' cannot return multiple rows with
+ *		the same values in each of 'exprs'.  Otherwise returns false.
+ */
+bool
+relation_has_uniquekeys_for(PlannerInfo *root, RelOptInfo *rel,
+							List *exprs, bool allow_multinulls)
+{
+	ListCell *lc;
+
+	/*
+	 * For UniqueKey->onerow case, the uniquekey->exprs is empty as well
+	 * so we can't rely on list_is_subset to handle this special cases
+	 */
+	if (exprs == NIL)
+		return false;
+
+	foreach(lc, rel->uniquekeys)
+	{
+		UniqueKey *ukey = lfirst_node(UniqueKey, lc);
+		if (ukey->multi_nullvals && !allow_multinulls)
+			continue;
+		if (list_is_subset(ukey->exprs, exprs))
+			return true;
+	}
+	return false;
+}
+
+
+/*
+ * get_exprs_from_uniqueindex
+ *
+ * Return a list of exprs which is unique. set useful to false if this
+ * unique index is not useful for us.
+ */
+static List *
+get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+						   List *const_exprs,
+						   List *const_expr_opfamilies,
+						   Bitmapset *used_varattrs,
+						   bool *useful,
+						   bool *multi_nullvals)
+{
+	List	*exprs = NIL;
+	ListCell	*indexpr_item;
+	int	c = 0;
+
+	*useful = true;
+	*multi_nullvals = false;
+
+	indexpr_item = list_head(unique_index->indexprs);
+	for(c = 0; c < unique_index->ncolumns; c++)
+	{
+		int attr = unique_index->indexkeys[c];
+		Expr *expr;
+		bool	matched_const = false;
+		ListCell	*lc1, *lc2;
+
+		if(attr > 0)
+		{
+			expr = list_nth_node(TargetEntry, unique_index->indextlist, c)->expr;
+		}
+		else if (attr == 0)
+		{
+			/* Expression index */
+			expr = lfirst(indexpr_item);
+			indexpr_item = lnext(unique_index->indexprs, indexpr_item);
+		}
+		else /* attr < 0 */
+		{
+			/* Index on system column is not supported */
+			Assert(false);
+		}
+
+		/*
+		 * Check index_col = Const case with regarding to opfamily checking
+		 * If we can remove the index_col from the final UniqueKey->exprs.
+		 */
+		forboth(lc1, const_exprs, lc2, const_expr_opfamilies)
+		{
+			if (list_member_oid((List *)lfirst(lc2), unique_index->opfamily[c])
+				&& match_index_to_operand((Node *) lfirst(lc1), c, unique_index))
+			{
+				matched_const = true;
+				break;
+			}
+		}
+
+		if (matched_const)
+			continue;
+
+		/* Check if the indexed expr is used in rel */
+		if (attr > 0)
+		{
+			/*
+			 * Normal Indexed column, if the col is not used, then the index is useless
+			 * for uniquekey.
+			 */
+			attr -= FirstLowInvalidHeapAttributeNumber;
+
+			if (!bms_is_member(attr, used_varattrs))
+			{
+				*useful = false;
+				break;
+			}
+		}
+		else if (!list_member(unique_index->rel->reltarget->exprs, expr))
+		{
+			/* Expression index but the expression is not used in rel */
+			*useful = false;
+			break;
+		}
+
+		/* check not null property. */
+		if (attr == 0)
+		{
+			/* We never know if a expression yields null or not */
+			*multi_nullvals = true;
+		}
+		else if (!bms_is_member(attr, unique_index->rel->notnullattrs)
+				 && !bms_is_member(0 - FirstLowInvalidHeapAttributeNumber,
+								   unique_index->rel->notnullattrs))
+		{
+			*multi_nullvals = true;
+		}
+
+		exprs = lappend(exprs, expr);
+	}
+	return exprs;
+}
+
+
+/*
+ * add_uniquekey_for_onerow
+ * If we are sure that the relation only returns one row, then all the columns
+ * are unique. However we don't need to create UniqueKey for every column, we
+ * just set exprs = NIL and overwrites all the other UniqueKey on this RelOptInfo
+ * since this one has strongest semantics.
+ */
+void
+add_uniquekey_for_onerow(RelOptInfo *rel)
+{
+	/*
+	 * We overwrite the previous UniqueKey on purpose since this one has the
+	 * strongest semantic.
+	 */
+	rel->uniquekeys = list_make1(makeUniqueKey(NIL, false));
+}
+
+
+/*
+ * initililze_uniquecontext_for_joinrel
+ * Return a List of UniqueKeyContext for an inputrel
+ */
+static List *
+initililze_uniquecontext_for_joinrel(RelOptInfo *inputrel)
+{
+	List	*res = NIL;
+	ListCell *lc;
+	foreach(lc,  inputrel->uniquekeys)
+	{
+		UniqueKeyContext context;
+		context = palloc(sizeof(struct UniqueKeyContextData));
+		context->uniquekey = lfirst_node(UniqueKey, lc);
+		context->added_to_joinrel = false;
+		context->useful = true;
+		res = lappend(res, context);
+	}
+	return res;
+}
+
+
+/*
+ * get_exprs_from_uniquekey
+ *	Unify the way of get List of exprs from a one-row UniqueKey or
+ * normal UniqueKey. for the onerow case, every expr in rel1 is a valid
+ * UniqueKey. Return a List of exprs.
+ *
+ * rel1: The relation which you want to get the exprs.
+ * ukey: The UniqueKey you want to get the exprs.
+ */
+static List *
+get_exprs_from_uniquekey(PlannerInfo *root, RelOptInfo *joinrel, RelOptInfo *rel1, UniqueKey *ukey)
+{
+	ListCell *lc;
+	bool onerow = rel1 != NULL && relation_is_onerow(rel1);
+
+	List	*res = NIL;
+	Assert(onerow || ukey);
+	if (onerow)
+	{
+		/* Only cares about the exprs still exist in joinrel */
+		foreach(lc, joinrel->reltarget->exprs)
+		{
+			Bitmapset *relids = pull_varnos(root, lfirst(lc));
+			if (bms_is_subset(relids, rel1->relids))
+			{
+				res = lappend(res, list_make1(lfirst(lc)));
+			}
+		}
+	}
+	else
+	{
+		res = list_make1(ukey->exprs);
+	}
+	return res;
+}
+
+/*
+ * Partitioned table Unique Keys.
+ * The partition table unique key is maintained as:
+ * 1. The index must be unique as usual.
+ * 2. The index must contains partition key.
+ * 3. The index must exist on all the child rel. see simple_indexinfo_equal for
+ *    how we compare it.
+ */
+
+/*
+ * index_constains_partkey
+ * return true if the index contains the partiton key.
+ */
+static bool
+index_constains_partkey(RelOptInfo *partrel,  IndexOptInfo *ind)
+{
+	ListCell	*lc;
+	int	i;
+	Assert(IS_PARTITIONED_REL(partrel));
+	Assert(partrel->part_scheme->partnatts > 0);
+
+	for(i = 0; i < partrel->part_scheme->partnatts; i++)
+	{
+		Node *part_expr = linitial(partrel->partexprs[i]);
+		bool found_in_index = false;
+		foreach(lc, ind->indextlist)
+		{
+			Expr *index_expr = lfirst_node(TargetEntry, lc)->expr;
+			if (equal(index_expr, part_expr))
+			{
+				found_in_index = true;
+				break;
+			}
+		}
+		if (!found_in_index)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * simple_indexinfo_equal
+ *
+ * Used to check if the 2 index is same as each other. The index here
+ * is COPIED from childrel and did some tiny changes(see
+ * simple_copy_indexinfo_to_parent)
+ */
+static bool
+simple_indexinfo_equal(IndexOptInfo *ind1, IndexOptInfo *ind2)
+{
+	Size oid_cmp_len = sizeof(Oid) * ind1->ncolumns;
+
+	return ind1->ncolumns == ind2->ncolumns &&
+		ind1->unique == ind2->unique &&
+		memcmp(ind1->indexkeys, ind2->indexkeys, sizeof(int) * ind1->ncolumns) == 0 &&
+		memcmp(ind1->opfamily, ind2->opfamily, oid_cmp_len) == 0 &&
+		memcmp(ind1->opcintype, ind2->opcintype, oid_cmp_len) == 0 &&
+		memcmp(ind1->sortopfamily, ind2->sortopfamily, oid_cmp_len) == 0 &&
+		equal(get_tlist_exprs(ind1->indextlist, true),
+			  get_tlist_exprs(ind2->indextlist, true));
+}
+
+
+/*
+ * The below macros are used for simple_copy_indexinfo_to_parent which is so
+ * customized that I don't want to put it to copyfuncs.c. So copy it here.
+ */
+#define COPY_POINTER_FIELD(fldname, sz) \
+	do { \
+		Size	_size = (sz); \
+		newnode->fldname = palloc(_size); \
+		memcpy(newnode->fldname, from->fldname, _size); \
+	} while (0)
+
+#define COPY_NODE_FIELD(fldname) \
+	(newnode->fldname = copyObjectImpl(from->fldname))
+
+#define COPY_SCALAR_FIELD(fldname) \
+	(newnode->fldname = from->fldname)
+
+
+/*
+ * simple_copy_indexinfo_to_parent (from partition)
+ * Copy the IndexInfo from child relation to parent relation with some modification,
+ * which is used to test:
+ * 1. If the same index exists in all the childrels.
+ * 2. If the parentrel->reltarget/basicrestrict info matches this index.
+ */
+static IndexOptInfo *
+simple_copy_indexinfo_to_parent(PlannerInfo *root,
+								RelOptInfo *parentrel,
+								IndexOptInfo *from)
+{
+	IndexOptInfo *newnode = makeNode(IndexOptInfo);
+	AppendRelInfo *appinfo = find_appinfo_by_child(root, from->rel->relid);
+	ListCell	*lc;
+	int	idx = 0;
+
+	COPY_SCALAR_FIELD(ncolumns);
+	COPY_SCALAR_FIELD(nkeycolumns);
+	COPY_SCALAR_FIELD(unique);
+	COPY_SCALAR_FIELD(immediate);
+	/* We just need to know if it is NIL or not */
+	COPY_SCALAR_FIELD(indpred);
+	COPY_SCALAR_FIELD(predOK);
+	COPY_POINTER_FIELD(indexkeys, from->ncolumns * sizeof(int));
+	COPY_POINTER_FIELD(indexcollations, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opfamily, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(opcintype, from->ncolumns * sizeof(Oid));
+	COPY_POINTER_FIELD(sortopfamily, from->ncolumns * sizeof(Oid));
+	COPY_NODE_FIELD(indextlist);
+
+	/* Convert index exprs on child expr to expr on parent */
+	foreach(lc, newnode->indextlist)
+	{
+		TargetEntry *tle = lfirst_node(TargetEntry, lc);
+		/* Index on expression is ignored */
+		Assert(IsA(tle->expr, Var));
+		tle->expr = (Expr *) find_parent_var(appinfo, (Var *) tle->expr);
+		newnode->indexkeys[idx] = castNode(Var, tle->expr)->varattno;
+		idx++;
+	}
+	newnode->rel = parentrel;
+	return newnode;
+}
+
+/*
+ * adjust_partition_unique_indexlist
+ *
+ * global_unique_indexes: At the beginning, it contains the copy & modified
+ * unique index from the first partition. And then check if each index in it still
+ * exists in the following partitions. If no, remove it. at last, it has an
+ * index list which exists in all the partitions.
+ */
+static void
+adjust_partition_unique_indexlist(PlannerInfo *root,
+								  RelOptInfo *parentrel,
+								  RelOptInfo *childrel,
+								  List **global_unique_indexes)
+{
+	ListCell	*lc, *lc2;
+	foreach(lc, *global_unique_indexes)
+	{
+		IndexOptInfo	*g_ind = lfirst_node(IndexOptInfo, lc);
+		bool found_in_child = false;
+
+		foreach(lc2, childrel->indexlist)
+		{
+			IndexOptInfo   *p_ind = lfirst_node(IndexOptInfo, lc2);
+			IndexOptInfo   *p_ind_copy;
+			if (!p_ind->unique || !p_ind->immediate ||
+				(p_ind->indpred != NIL && !p_ind->predOK))
+				continue;
+			p_ind_copy = simple_copy_indexinfo_to_parent(root, parentrel, p_ind);
+			if (simple_indexinfo_equal(p_ind_copy, g_ind))
+			{
+				found_in_child = true;
+				break;
+			}
+		}
+		if (!found_in_child)
+			/* The index doesn't exist in childrel, remove it from global_unique_indexes */
+			*global_unique_indexes = foreach_delete_current(*global_unique_indexes, lc);
+	}
+}
+
+/* Helper function for groupres/distinctrel */
+static void
+add_uniquekey_from_sortgroups(PlannerInfo *root, RelOptInfo *rel, List *sortgroups)
+{
+	Query *parse = root->parse;
+	List	*exprs;
+
+	/*
+	 * XXX: If there are some vars which is not in current levelsup, the semantic is
+	 * imprecise, should we avoid it or not? levelsup = 1 is just a demo, maybe we need to
+	 * check every level other than 0, if so, looks we have to write another
+	 * pull_var_walker.
+	 */
+	List	*upper_vars = pull_vars_of_level((Node*)sortgroups, 1);
+
+	if (upper_vars != NIL)
+		return;
+
+	exprs = get_sortgrouplist_exprs(sortgroups, parse->targetList);
+	rel->uniquekeys = lappend(rel->uniquekeys,
+							  makeUniqueKey(exprs,
+											false /* sortgroupclause can't be multi_nullvals */));
+}
+
+
+/*
+ * add_combined_uniquekey
+ * The combination of both UniqueKeys is a valid UniqueKey for joinrel no matter
+ * the jointype.
+ */
+bool
+add_combined_uniquekey(PlannerInfo *root,
+					   RelOptInfo *joinrel,
+					   RelOptInfo *outer_rel,
+					   RelOptInfo *inner_rel,
+					   UniqueKey *outer_ukey,
+					   UniqueKey *inner_ukey,
+					   JoinType jointype)
+{
+
+	ListCell	*lc1, *lc2;
+
+	/* Either side has multi_nullvals or we have outer join,
+	 * the combined UniqueKey has multi_nullvals */
+	bool multi_nullvals = outer_ukey->multi_nullvals ||
+		inner_ukey->multi_nullvals || IS_OUTER_JOIN(jointype);
+
+	/* The only case we can get onerow joinrel after join */
+	if  (relation_is_onerow(outer_rel)
+		 && relation_is_onerow(inner_rel)
+		 && jointype == JOIN_INNER)
+	{
+		add_uniquekey_for_onerow(joinrel);
+		return true;
+	}
+
+	foreach(lc1, get_exprs_from_uniquekey(root, joinrel, outer_rel, outer_ukey))
+	{
+		foreach(lc2, get_exprs_from_uniquekey(root, joinrel, inner_rel, inner_ukey))
+		{
+			List *exprs = list_concat_copy(lfirst_node(List, lc1), lfirst_node(List, lc2));
+			joinrel->uniquekeys = lappend(joinrel->uniquekeys,
+										  makeUniqueKey(exprs,
+														multi_nullvals));
+		}
+	}
+	return false;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index bd09f85aea..ff9f6df857 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -187,8 +187,7 @@ static void create_one_window_path(PlannerInfo *root,
 								   PathTarget *output_target,
 								   WindowFuncLists *wflists,
 								   List *activeWindows);
-static RelOptInfo *create_distinct_paths(PlannerInfo *root,
-										 RelOptInfo *input_rel);
+static RelOptInfo *create_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel);
 static void create_partial_distinct_paths(PlannerInfo *root,
 										  RelOptInfo *input_rel,
 										  RelOptInfo *final_distinct_rel);
@@ -1866,6 +1865,8 @@ grouping_planner(PlannerInfo *root, double tuple_fraction)
 		add_path(final_rel, path);
 	}
 
+	simple_copy_uniquekeys(current_rel, final_rel);
+
 	/*
 	 * Generate partial paths for final_rel, too, if outer query levels might
 	 * be able to make use of them.
@@ -3380,6 +3381,8 @@ create_grouping_paths(PlannerInfo *root,
 	}
 
 	set_cheapest(grouped_rel);
+
+	populate_grouprel_uniquekeys(root, grouped_rel, input_rel);
 	return grouped_rel;
 }
 
@@ -4102,7 +4105,7 @@ create_window_paths(PlannerInfo *root,
 
 	/* Now choose the best path(s) */
 	set_cheapest(window_rel);
-
+	simple_copy_uniquekeys(input_rel, window_rel);
 	return window_rel;
 }
 
@@ -4291,6 +4294,7 @@ create_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel)
 
 	/* Now choose the best path(s) */
 	set_cheapest(distinct_rel);
+	populate_distinctrel_uniquekeys(root, input_rel, distinct_rel);
 
 	return distinct_rel;
 }
@@ -4823,6 +4827,8 @@ create_ordered_paths(PlannerInfo *root,
 	 */
 	Assert(ordered_rel->pathlist != NIL);
 
+	simple_copy_uniquekeys(input_rel, ordered_rel);
+
 	return ordered_rel;
 }
 
@@ -5700,6 +5706,9 @@ adjust_paths_for_srfs(PlannerInfo *root, RelOptInfo *rel,
 	if (list_length(targets) == 1)
 		return;
 
+	/* UniqueKey is not valid after handling the SRF. */
+	rel->uniquekeys = NIL;
+
 	/*
 	 * Stack SRF-evaluation nodes atop each path for the rel.
 	 *
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index f004fad1d9..6c604aff52 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -690,6 +690,8 @@ generate_union_paths(SetOperationStmt *op, PlannerInfo *root,
 	/* Undo effects of possibly forcing tuple_fraction to 0 */
 	root->tuple_fraction = save_fraction;
 
+	/* Add the UniqueKeys */
+	populate_unionrel_uniquekeys(root, result_rel);
 	return result_rel;
 }
 
diff --git a/src/backend/optimizer/util/appendinfo.c b/src/backend/optimizer/util/appendinfo.c
index b8039c323b..f204c539a6 100644
--- a/src/backend/optimizer/util/appendinfo.c
+++ b/src/backend/optimizer/util/appendinfo.c
@@ -1000,3 +1000,47 @@ distribute_row_identity_vars(PlannerInfo *root)
 		}
 	}
 }
+
+/*
+ * find_appinfo_by_child
+ *
+ */
+AppendRelInfo *
+find_appinfo_by_child(PlannerInfo *root, Index child_index)
+{
+	ListCell	*lc;
+	foreach(lc, root->append_rel_list)
+	{
+		AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
+		if (appinfo->child_relid == child_index)
+			return appinfo;
+	}
+	elog(ERROR, "parent relation cant be found");
+	return NULL;
+}
+
+/*
+ * find_parent_var
+ *
+ */
+Var *
+find_parent_var(AppendRelInfo *appinfo, Var *child_var)
+{
+	ListCell	*lc;
+	Var	*res = NULL;
+	Index attno = 1;
+	foreach(lc, appinfo->translated_vars)
+	{
+		Node *child_node = lfirst(lc);
+		if (equal(child_node, child_var))
+		{
+			res = copyObject(child_var);
+			res->varattno = attno;
+			res->varno = appinfo->parent_relid;
+		}
+		attno++;
+	}
+	if (res == NULL)
+		elog(ERROR, "parent var can't be found.");
+	return res;
+}
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 8c5dc65947..9b49f12e43 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -797,6 +797,7 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 		{
 			Node	   *onecq = (Node *) lfirst(lc2);
 			bool		pseudoconstant;
+			RestrictInfo	*child_rinfo;
 
 			/* check for pseudoconstant (no Vars or volatile functions) */
 			pseudoconstant =
@@ -808,14 +809,15 @@ apply_child_basequals(PlannerInfo *root, RelOptInfo *parentrel,
 				root->hasPseudoConstantQuals = true;
 			}
 			/* reconstitute RestrictInfo with appropriate properties */
-			childquals = lappend(childquals,
-								 make_restrictinfo(root,
-												   (Expr *) onecq,
-												   rinfo->is_pushed_down,
-												   rinfo->outerjoin_delayed,
-												   pseudoconstant,
-												   rinfo->security_level,
-												   NULL, NULL, NULL));
+			child_rinfo =  make_restrictinfo(root,
+											 (Expr *) onecq,
+											 rinfo->is_pushed_down,
+											 rinfo->outerjoin_delayed,
+											 pseudoconstant,
+											 rinfo->security_level,
+											 NULL, NULL, NULL);
+			child_rinfo->mergeopfamilies = rinfo->mergeopfamilies;
+			childquals = lappend(childquals, child_rinfo);
 			/* track minimum security level among child quals */
 			cq_min_security = Min(cq_min_security, rinfo->security_level);
 		}
diff --git a/src/include/nodes/makefuncs.h b/src/include/nodes/makefuncs.h
index fe173101d1..bfda94e907 100644
--- a/src/include/nodes/makefuncs.h
+++ b/src/include/nodes/makefuncs.h
@@ -16,6 +16,7 @@
 
 #include "nodes/execnodes.h"
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 
 extern A_Expr *makeA_Expr(A_Expr_Kind kind, List *name,
@@ -106,4 +107,6 @@ extern GroupingSet *makeGroupingSet(GroupingSetKind kind, List *content, int loc
 
 extern VacuumRelation *makeVacuumRelation(RangeVar *relation, Oid oid, List *va_cols);
 
+extern UniqueKey* makeUniqueKey(List *exprs, bool multi_nullvals);
+
 #endif							/* MAKEFUNC_H */
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 28cf5aefca..75a406b982 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -267,6 +267,7 @@ typedef enum NodeTag
 	T_EquivalenceMember,
 	T_PathKey,
 	T_PathTarget,
+	T_UniqueKey,
 	T_RestrictInfo,
 	T_IndexClause,
 	T_PlaceHolderVar,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 58b9ef71a7..57e80a47d2 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -748,6 +748,7 @@ typedef struct RelOptInfo
 	QualCost	baserestrictcost;	/* cost of evaluating the above */
 	Index		baserestrict_min_security;	/* min security_level found in
 											 * baserestrictinfo */
+	List	   *uniquekeys;		/* List of UniqueKey */
 	List	   *joininfo;		/* RestrictInfo structures for join clauses
 								 * involving this rel */
 	bool		has_eclass_joins;	/* T means joininfo is incomplete */
@@ -1083,6 +1084,28 @@ typedef enum VolatileFunctionStatus
 	VOLATILITY_NOVOLATILE
 } VolatileFunctionStatus;
 
+/*
+ * UniqueKey
+ *
+ * Represents the unique properties held by a RelOptInfo.
+ *
+ * exprs is a list of exprs which is unique on current RelOptInfo. exprs = NIL
+ * is a special case of UniqueKey, which means there is only 1 row in that
+ * relation.
+ * multi_nullvals: true means multi null values may exist in these exprs, so the
+ * uniqueness is not guaranteed in this case. This field is necessary for
+ * remove_useless_join & reduce_unique_semijoins where we don't mind these
+ * duplicated NULL values. It is set to true for 2 cases. One is a unique key
+ * from a unique index but the related column is nullable. The other one is for
+ * outer join. see populate_joinrel_uniquekeys for detail.
+ */
+typedef struct UniqueKey
+{
+	NodeTag		type;
+	List	   *exprs;
+	bool		multi_nullvals;
+} UniqueKey;
+
 /*
  * PathTarget
  *
@@ -2581,7 +2604,7 @@ typedef enum
  *
  * flags indicating what kinds of grouping are possible.
  * partial_costs_set is true if the agg_partial_costs and agg_final_costs
- * 		have been initialized.
+ *		have been initialized.
  * agg_partial_costs gives partial aggregation costs.
  * agg_final_costs gives finalization costs.
  * target_parallel_safe is true if target is parallel safe.
@@ -2611,8 +2634,8 @@ typedef struct
  * limit_tuples is an estimated bound on the number of output tuples,
  *		or -1 if no LIMIT or couldn't estimate.
  * count_est and offset_est are the estimated values of the LIMIT and OFFSET
- * 		expressions computed by preprocess_limit() (see comments for
- * 		preprocess_limit() for more information).
+ *		expressions computed by preprocess_limit() (see comments for
+ *		preprocess_limit() for more information).
  */
 typedef struct
 {
diff --git a/src/include/nodes/pg_list.h b/src/include/nodes/pg_list.h
index 2cb9d1371d..d1bfbb77c6 100644
--- a/src/include/nodes/pg_list.h
+++ b/src/include/nodes/pg_list.h
@@ -558,6 +558,7 @@ extern bool list_member_ptr(const List *list, const void *datum);
 extern bool list_member_int(const List *list, int datum);
 extern bool list_member_oid(const List *list, Oid datum);
 
+extern bool list_is_subset(const List *members, const List *target);
 extern pg_nodiscard List *list_delete(List *list, void *datum);
 extern pg_nodiscard List *list_delete_ptr(List *list, void *datum);
 extern pg_nodiscard List *list_delete_int(List *list, int datum);
diff --git a/src/include/optimizer/appendinfo.h b/src/include/optimizer/appendinfo.h
index fc808dcd27..b2d9df2df8 100644
--- a/src/include/optimizer/appendinfo.h
+++ b/src/include/optimizer/appendinfo.h
@@ -47,4 +47,7 @@ extern void add_row_identity_columns(PlannerInfo *root, Index rtindex,
 									 Relation target_relation);
 extern void distribute_row_identity_vars(PlannerInfo *root);
 
+extern AppendRelInfo *find_appinfo_by_child(PlannerInfo *root, Index child_index);
+extern Var *find_parent_var(AppendRelInfo *appinfo, Var *child_var);
+
 #endif							/* APPENDINFO_H */
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index 6b8ee0c69f..2851d04de9 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -23,6 +23,7 @@
 #define OPTIMIZER_H
 
 #include "nodes/parsenodes.h"
+#include "nodes/pathnodes.h"
 
 /* Test if an expression node represents a SRF call.  Beware multiple eval! */
 #define IS_SRF_CALL(node) \
@@ -171,6 +172,7 @@ extern TargetEntry *get_sortgroupref_tle(Index sortref,
 										 List *targetList);
 extern TargetEntry *get_sortgroupclause_tle(SortGroupClause *sgClause,
 											List *targetList);
+extern Var *find_var_for_subquery_tle(RelOptInfo *rel, TargetEntry *tle);
 extern Node *get_sortgroupclause_expr(SortGroupClause *sgClause,
 									  List *targetList);
 extern List *get_sortgrouplist_exprs(List *sgClauses,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 0c3a0b90c8..28db3c59cd 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -254,5 +254,48 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 									   int strategy, bool nulls_first);
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 									List *live_childrels);
+extern List *select_mergejoin_clauses(PlannerInfo *root,
+									  RelOptInfo *joinrel,
+									  RelOptInfo *outerrel,
+									  RelOptInfo *innerrel,
+									  List *restrictlist,
+									  JoinType jointype,
+									  bool *mergejoin_allowed);
+
+/*
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ */
+extern void populate_baserel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *baserel,
+										List* unique_index_list);
+extern void populate_partitionedrel_uniquekeys(PlannerInfo *root,
+												RelOptInfo *rel,
+												List *childrels);
+extern void populate_distinctrel_uniquekeys(PlannerInfo *root,
+											RelOptInfo *inputrel,
+											RelOptInfo *distinctrel);
+extern void populate_grouprel_uniquekeys(PlannerInfo *root,
+										 RelOptInfo *grouprel,
+										 RelOptInfo *inputrel);
+extern void populate_unionrel_uniquekeys(PlannerInfo *root,
+										  RelOptInfo *unionrel);
+extern void simple_copy_uniquekeys(RelOptInfo *oldrel,
+								   RelOptInfo *newrel);
+extern void convert_subquery_uniquekeys(PlannerInfo *root,
+										RelOptInfo *currel,
+										RelOptInfo *sub_final_rel);
+extern void populate_joinrel_uniquekeys(PlannerInfo *root,
+										RelOptInfo *joinrel,
+										RelOptInfo *rel1,
+										RelOptInfo *rel2,
+										List *restrictlist,
+										JoinType jointype);
+
+extern bool relation_has_uniquekeys_for(PlannerInfo *root,
+										RelOptInfo *rel,
+										List *exprs,
+										bool allow_multinulls);
+extern bool relation_is_onerow(RelOptInfo *rel);
 
 #endif							/* PATHS_H */
-- 
2.33.1

v3-0001-Extend-UniqueKeys.patchapplication/octet-stream; name=v3-0001-Extend-UniqueKeys.patchDownload

From 9e247a70a72c42a4ae2c69ad4b8483eb477eacc2 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 8 Jun 2020 20:33:56 +0200
Subject: [PATCH 3/5] Extend UniqueKeys

Prepares index skip scan implementation using UniqueKeys. Allows to
specify what are the "requested" keys that should be unique, and add
them to necessary Paths to make them useful later.

Proposed by David Rowley, contains few bits out of previous version from
Jesper Pedersen.
---
 src/backend/optimizer/path/pathkeys.c   | 59 +++++++++++++++++++++++
 src/backend/optimizer/path/uniquekeys.c | 63 +++++++++++++++++++++++++
 src/backend/optimizer/plan/planner.c    | 36 +++++++++++++-
 src/backend/optimizer/util/pathnode.c   | 32 +++++++++----
 src/include/nodes/pathnodes.h           |  5 ++
 src/include/optimizer/pathnode.h        |  1 +
 src/include/optimizer/paths.h           |  8 ++++
 7 files changed, 194 insertions(+), 10 deletions(-)

diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 9022c77dac..9b7cdce350 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
 #include "utils/lsyscache.h"
 
 
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
@@ -95,6 +96,29 @@ make_canonical_pathkey(PlannerInfo *root,
 	return pk;
 }
 
+/*
+ * pathkey_is_unique
+ *	   Checks if the new pathkey's equivalence class is the same as that of
+ *     any existing member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+	EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+	ListCell   *lc;
+
+	/* If same EC already is already in the list, then not unique */
+	foreach(lc, pathkeys)
+	{
+		PathKey    *old_pathkey = (PathKey *) lfirst(lc);
+
+		if (new_ec == old_pathkey->pk_eclass)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * pathkey_is_redundant
  *	   Is a pathkey redundant with one already in the given list?
@@ -1151,6 +1175,41 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
 	return pathkeys;
 }
 
+/*
+ * make_pathkeys_for_uniquekeyclauses
+ *		Generate a pathkeys list to be used for uniquekey clauses
+ */
+List *
+make_pathkeys_for_uniquekeys(PlannerInfo *root,
+							 List *sortclauses,
+							 List *tlist)
+{
+	List	   *pathkeys = NIL;
+	ListCell   *l;
+
+	foreach(l, sortclauses)
+	{
+		SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+		Expr	   *sortkey;
+		PathKey    *pathkey;
+
+		sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+		Assert(OidIsValid(sortcl->sortop));
+		pathkey = make_pathkey_from_sortop(root,
+										   sortkey,
+										   root->nullable_baserels,
+										   sortcl->sortop,
+										   sortcl->nulls_first,
+										   sortcl->tleSortGroupRef,
+										   true);
+
+		if (pathkey_is_unique(pathkey, pathkeys))
+			pathkeys = lappend(pathkeys, pathkey);
+	}
+
+	return pathkeys;
+}
+
 /****************************************************************************
  *		PATHKEYS AND MERGECLAUSES
  ****************************************************************************/
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
index ca40c40858..ab4b1d1939 100644
--- a/src/backend/optimizer/path/uniquekeys.c
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -1132,3 +1132,66 @@ add_combined_uniquekey(PlannerInfo *root,
 	}
 	return false;
 }
+
+List*
+build_uniquekeys(PlannerInfo *root, List *sortclauses)
+{
+	List *result = NIL;
+	List *sortkeys;
+	ListCell *l;
+	List *exprs = NIL;
+
+	sortkeys = make_pathkeys_for_uniquekeys(root,
+											sortclauses,
+											root->processed_tlist);
+
+	/* Create a uniquekey and add it to the list */
+	foreach(l, sortkeys)
+	{
+		PathKey    *pathkey = (PathKey *) lfirst(l);
+		EquivalenceClass *ec = pathkey->pk_eclass;
+		EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members));
+		if (EC_MUST_BE_REDUNDANT(ec))
+			continue;
+		exprs = lappend(exprs, mem->em_expr);
+	}
+
+	result = lappend(result, makeUniqueKey(exprs, false));
+
+	return result;
+}
+
+bool
+query_has_uniquekeys_for(PlannerInfo *root, List *pathuniquekeys,
+						 bool allow_multinulls)
+{
+	ListCell *lc;
+	ListCell *lc2;
+
+	/* root->query_uniquekeys are the requested DISTINCT clauses on query level
+	 * pathuniquekeys are the unique keys on current path.
+	 * All requested query_uniquekeys must be satisfied by the pathuniquekeys
+	 */
+	foreach(lc, root->query_uniquekeys)
+	{
+		UniqueKey *query_ukey = lfirst_node(UniqueKey, lc);
+		bool satisfied = false;
+		foreach(lc2, pathuniquekeys)
+		{
+			UniqueKey *ukey = lfirst_node(UniqueKey, lc2);
+			if (ukey->multi_nullvals && !allow_multinulls)
+				continue;
+			if (list_length(ukey->exprs) == 0 &&
+				list_length(query_ukey->exprs) != 0)
+				continue;
+			if (list_is_subset(ukey->exprs, query_ukey->exprs))
+			{
+				satisfied = true;
+				break;
+			}
+		}
+		if (!satisfied)
+			return false;
+	}
+	return true;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index ff9f6df857..5ae2475400 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3082,12 +3082,18 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 	 */
 	if (qp_extra->groupClause &&
 		grouping_is_sortable(qp_extra->groupClause))
+	{
 		root->group_pathkeys =
 			make_pathkeys_for_sortclauses(root,
 										  qp_extra->groupClause,
 										  tlist);
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
+	{
 		root->group_pathkeys = NIL;
+		root->query_uniquekeys = NIL;
+	}
 
 	/* We consider only the first (bottom) window in pathkeys logic */
 	if (activeWindows != NIL)
@@ -4497,13 +4503,19 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
 			Path	   *path = (Path *) lfirst(lc);
 
 			if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
-			{
 				add_path(distinct_rel, (Path *)
 						 create_upper_unique_path(root, distinct_rel,
 												  path,
 												  list_length(root->distinct_pathkeys),
 												  numDistinctRows));
-			}
+		}
+
+		foreach(lc, input_rel->unique_pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+
+			if (query_has_uniquekeys_for(root, needed_pathkeys, false))
+				add_path(distinct_rel, path);
 		}
 
 		/* For explicit-sort case, always use the more rigorous clause */
@@ -7118,6 +7130,26 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		}
 	}
 
+	foreach(lc, rel->unique_pathlist)
+	{
+		Path	   *subpath = (Path *) lfirst(lc);
+
+		/* Shouldn't have any parameterized paths anymore */
+		Assert(subpath->param_info == NULL);
+
+		if (tlist_same_exprs)
+			subpath->pathtarget->sortgrouprefs =
+				scanjoin_target->sortgrouprefs;
+		else
+		{
+			Path	   *newpath;
+
+			newpath = (Path *) create_projection_path(root, rel, subpath,
+													  scanjoin_target);
+			lfirst(lc) = newpath;
+		}
+	}
+
 	/*
 	 * Now, if final scan/join target contains SRFs, insert ProjectSetPath(s)
 	 * atop each existing path.  (Note that this function doesn't look at the
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 5c32c96b71..abb77d867e 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -416,10 +416,10 @@ set_cheapest(RelOptInfo *parent_rel)
  * 'parent_rel' is the relation entry to which the path corresponds.
  * 'new_path' is a potential path for parent_rel.
  *
- * Returns nothing, but modifies parent_rel->pathlist.
+ * Returns modified pathlist.
  */
-void
-add_path(RelOptInfo *parent_rel, Path *new_path)
+static List *
+add_path_to(RelOptInfo *parent_rel, List *pathlist, Path *new_path)
 {
 	bool		accept_new = true;	/* unless we find a superior old path */
 	int			insert_at = 0;	/* where to insert new item */
@@ -440,7 +440,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	 * for more than one old path to be tossed out because new_path dominates
 	 * it.
 	 */
-	foreach(p1, parent_rel->pathlist)
+	foreach(p1, pathlist)
 	{
 		Path	   *old_path = (Path *) lfirst(p1);
 		bool		remove_old = false; /* unless new proves superior */
@@ -584,8 +584,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		 */
 		if (remove_old)
 		{
-			parent_rel->pathlist = foreach_delete_current(parent_rel->pathlist,
-														  p1);
+			pathlist = foreach_delete_current(pathlist, p1);
 
 			/*
 			 * Delete the data pointed-to by the deleted cell, if possible
@@ -612,8 +611,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	if (accept_new)
 	{
 		/* Accept the new path: insert it at proper place in pathlist */
-		parent_rel->pathlist =
-			list_insert_nth(parent_rel->pathlist, insert_at, new_path);
+		pathlist = list_insert_nth(pathlist, insert_at, new_path);
 	}
 	else
 	{
@@ -621,6 +619,23 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		if (!IsA(new_path, IndexPath))
 			pfree(new_path);
 	}
+
+	return pathlist;
+}
+
+void
+add_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->pathlist = add_path_to(parent_rel,
+									   parent_rel->pathlist, new_path);
+}
+
+void
+add_unique_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->unique_pathlist = add_path_to(parent_rel,
+											  parent_rel->unique_pathlist,
+											  new_path);
 }
 
 /*
@@ -2662,6 +2677,7 @@ create_projection_path(PlannerInfo *root,
 	pathnode->path.pathkeys = subpath->pathkeys;
 
 	pathnode->subpath = subpath;
+	pathnode->path.uniquekeys = subpath->uniquekeys;
 
 	/*
 	 * We might not need a separate Result node.  If the input plan node type
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 57e80a47d2..1de5095e74 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -293,6 +293,7 @@ struct PlannerInfo
 
 	List	   *query_pathkeys; /* desired pathkeys for query_planner() */
 
+	List	   *query_uniquekeys; /* unique keys required for the query */
 	List	   *group_pathkeys; /* groupClause pathkeys, if any */
 	List	   *window_pathkeys;	/* pathkeys of bottom window, if any */
 	List	   *distinct_pathkeys;	/* distinctClause pathkeys, if any */
@@ -695,6 +696,7 @@ typedef struct RelOptInfo
 	List	   *pathlist;		/* Path structures */
 	List	   *ppilist;		/* ParamPathInfos used in pathlist */
 	List	   *partial_pathlist;	/* partial Paths */
+	List       *unique_pathlist;    /* unique Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
 	struct Path *cheapest_unique_path;
@@ -886,6 +888,7 @@ struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanskip;		/* can AM skip duplicate values? */
 	bool		amcanparallel;	/* does AM support parallel scan? */
 	bool		amcanmarkpos;	/* does AM support mark/restore? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
@@ -1220,6 +1223,8 @@ typedef struct Path
 
 	List	   *pathkeys;		/* sort ordering of path's output */
 	/* pathkeys is a List of PathKey nodes; see above */
+
+	List	   *uniquekeys;	/* the unique keys, or NIL if none */
 } Path;
 
 /* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 620eeda2d6..bb6d730e93 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -27,6 +27,7 @@ extern int	compare_fractional_path_costs(Path *path1, Path *path2,
 										  double fraction);
 extern void set_cheapest(RelOptInfo *parent_rel);
 extern void add_path(RelOptInfo *parent_rel, Path *new_path);
+extern void add_unique_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_path_precheck(RelOptInfo *parent_rel,
 							  Cost startup_cost, Cost total_cost,
 							  List *pathkeys, Relids required_outer);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 28db3c59cd..16bb5e0eea 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -229,6 +229,9 @@ extern List *build_join_pathkeys(PlannerInfo *root,
 extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
 										   List *sortclauses,
 										   List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+										  List *sortclauses,
+										  List *tlist);
 extern void initialize_mergeclause_eclasses(PlannerInfo *root,
 											RestrictInfo *restrictinfo);
 extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -296,6 +299,11 @@ extern bool relation_has_uniquekeys_for(PlannerInfo *root,
 										RelOptInfo *rel,
 										List *exprs,
 										bool allow_multinulls);
+extern bool query_has_uniquekeys_for(PlannerInfo *root,
+									 List *exprs,
+									 bool allow_multinulls);
 extern bool relation_is_onerow(RelOptInfo *rel);
 
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+
 #endif							/* PATHS_H */
-- 
2.33.1

v3-0002-Index-skip-scan.patchapplication/octet-stream; name=v3-0002-Index-skip-scan.patchDownload

From 8b52789b27fc5730b5715a91fbfe96eba7562be1 Mon Sep 17 00:00:00 2001
From: Floris van Nee <floris.vannee@gmail.com>
Date: Fri, 15 Nov 2019 09:46:53 -0500
Subject: [PATCH 4/5] Index skip scan

Implementation of Index Skip Scan (see Loose Index Scan in the wiki [1])
as part of the IndexOnlyScan, IndexScan and BitmapIndexScan for nbtree.
This patch improves performance of two main types of queries significantly:
- SELECT DISTINCT, SELECT DISTINCT ON
- Regular SELECTs with WHERE-clauses on non-leading index attributes
For example, given an nbtree index on three columns (a,b,c), the following queries
may now be significantly faster:
- SELECT DISTINCT ON (a) * FROM t1
- SELECT * FROM t1 WHERE b=2
- SELECT * FROM t1 WHERE b IN (10,40)
- SELECT DISTINCT ON (a,b) * FROM t1 WHERE c BETWEEN 10 AND 100 ORDER BY a,b,c

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen. Further enhanced functionality
added by Floris van Nee regarding a more general and performant skip implementation.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com

Author: Floris van Nee, Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 contrib/amcheck/verify_nbtree.c               |    4 +-
 contrib/bloom/blutils.c                       |    3 +
 doc/src/sgml/config.sgml                      |   15 +
 doc/src/sgml/indexam.sgml                     |  121 +-
 doc/src/sgml/indices.sgml                     |   28 +
 src/backend/access/brin/brin.c                |    3 +
 src/backend/access/gin/ginutil.c              |    3 +
 src/backend/access/gist/gist.c                |    3 +
 src/backend/access/hash/hash.c                |    3 +
 src/backend/access/index/indexam.c            |  163 ++
 src/backend/access/nbtree/Makefile            |    1 +
 src/backend/access/nbtree/nbtinsert.c         |    2 +-
 src/backend/access/nbtree/nbtpage.c           |    2 +-
 src/backend/access/nbtree/nbtree.c            |   58 +-
 src/backend/access/nbtree/nbtsearch.c         |  790 ++++-----
 src/backend/access/nbtree/nbtskip.c           | 1455 +++++++++++++++++
 src/backend/access/nbtree/nbtsort.c           |    2 +-
 src/backend/access/nbtree/nbtutils.c          |  850 +++++++++-
 src/backend/access/spgist/spgutils.c          |    3 +
 src/backend/commands/explain.c                |   29 +
 src/backend/executor/execScan.c               |   37 +-
 src/backend/executor/nodeBitmapIndexscan.c    |   22 +-
 src/backend/executor/nodeIndexonlyscan.c      |   69 +-
 src/backend/executor/nodeIndexscan.c          |   72 +-
 src/backend/nodes/copyfuncs.c                 |    5 +
 src/backend/nodes/outfuncs.c                  |    6 +
 src/backend/nodes/readfuncs.c                 |    5 +
 src/backend/optimizer/path/allpaths.c         |   52 +-
 src/backend/optimizer/path/costsize.c         |    1 +
 src/backend/optimizer/path/indxpath.c         |   68 +
 src/backend/optimizer/path/pathkeys.c         |   72 +
 src/backend/optimizer/plan/createplan.c       |   38 +-
 src/backend/optimizer/plan/planner.c          |   16 +-
 src/backend/optimizer/util/pathnode.c         |   78 +
 src/backend/optimizer/util/plancat.c          |    3 +
 src/backend/utils/misc/guc.c                  |    9 +
 src/backend/utils/misc/postgresql.conf.sample |    1 +
 src/backend/utils/sort/tuplesort.c            |    4 +-
 src/include/access/amapi.h                    |   19 +
 src/include/access/genam.h                    |   16 +
 src/include/access/nbtree.h                   |  143 +-
 src/include/executor/executor.h               |    4 +
 src/include/nodes/execnodes.h                 |    7 +
 src/include/nodes/pathnodes.h                 |    6 +
 src/include/nodes/plannodes.h                 |    5 +
 src/include/optimizer/cost.h                  |    1 +
 src/include/optimizer/pathnode.h              |    4 +
 src/include/optimizer/paths.h                 |    4 +
 src/interfaces/libpq/encnames.c               |    1 +
 src/interfaces/libpq/wchar.c                  |    1 +
 src/test/regress/expected/select_distinct.out |  599 +++++++
 src/test/regress/expected/sysviews.out        |    3 +-
 src/test/regress/sql/select_distinct.sql      |  248 +++
 53 files changed, 4611 insertions(+), 546 deletions(-)
 create mode 100644 src/backend/access/nbtree/nbtskip.c
 create mode 120000 src/interfaces/libpq/encnames.c
 create mode 120000 src/interfaces/libpq/wchar.c

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index d2510ee648..9914ab1f7a 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -2653,7 +2653,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 	Buffer		lbuf;
 	bool		exists;
 
-	key = _bt_mkscankey(state->rel, itup);
+	key = _bt_mkscankey(state->rel, itup, NULL);
 	Assert(key->heapkeyspace && key->scantid != NULL);
 
 	/*
@@ -3109,7 +3109,7 @@ bt_mkscankey_pivotsearch(Relation rel, IndexTuple itup)
 {
 	BTScanInsert skey;
 
-	skey = _bt_mkscankey(rel, itup);
+	skey = _bt_mkscankey(rel, itup, NULL);
 	skey->pivotsearch = true;
 
 	return skey;
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index a434cf93ef..de1bc45800 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -134,6 +134,9 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = blcostestimate;
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index afbb6c35e3..8a669b644e 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5012,6 +5012,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+      <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of index-skip-scan plan
+        types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+        <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-material" xreflabel="enable_material">
       <term><varname>enable_material</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 84de931071..e450b68eeb 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -153,6 +153,9 @@ typedef struct IndexAmRoutine
     amendscan_function amendscan;
     ammarkpos_function ammarkpos;       /* can be NULL */
     amrestrpos_function amrestrpos;     /* can be NULL */
+    amskip_function amskip;                        /* can be NULL */
+    ambeginscan_skip_function ambeginskipscan;     /* can be NULL */
+    amgettuple_with_skip_function amgetskiptuple;  /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -762,6 +765,122 @@ amrestrpos (IndexScanDesc scan);
    struct may be set to NULL.
   </para>
 
+  <para>
+<programlisting>
+bool
+amskip (IndexScanDesc scan,
+        ScanDirection prefixDir,
+	ScanDirection postfixDir);
+</programlisting>
+  Skip past all tuples where the first 'prefix' columns have the same value as
+  the last tuple returned in the current scan. The arguments are:
+
+   <variablelist>
+    <varlistentry>
+     <term><parameter>scan</parameter></term>
+     <listitem>
+      <para>
+       Index scan information
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>prefixDir</parameter></term>
+     <listitem>
+      <para>
+       The direction in which the prefix part of the tuple is advancing.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>postfixDir</parameter></term>
+     <listitem>
+      <para>
+        The direction in which the postfix (everything after the prefix) of the tuple is advancing.
+      </para>
+     </listitem>
+    </varlistentry>
+   </variablelist>
+
+  </para>
+  <para>
+<programlisting>
+IndexScanDesc
+ambeginscan_skip (Relation indexRelation,
+             int nkeys,
+	     int norderbys,
+	     int prefix);
+</programlisting>
+   Prepare for an index scan.  The <literal>nkeys</literal> and <literal>norderbys</literal>
+   parameters indicate the number of quals and ordering operators that will be
+   used in the scan; these may be useful for space allocation purposes.
+   Note that the actual values of the scan keys aren't provided yet.
+   The result must be a palloc'd struct.
+   For implementation reasons the index access method
+   <emphasis>must</emphasis> create this struct by calling
+   <function>RelationGetIndexScan()</function>.  In most cases
+   <function>ambeginscan</function> does little beyond making that call and perhaps
+   acquiring locks;
+   the interesting parts of index-scan startup are in <function>amrescan</function>.
+   If this is a skip scan, prefix must indicate the length of the prefix that can be
+   skipped over. Prefix can be set to -1 to disable skipping, which will yield an
+   identical scan to a regular call to <function>ambeginscan</function>.
+  </para>
+  <para>
+  <programlisting>
+  boolean
+  amgettuple_skip (IndexScanDesc scan,
+              ScanDirection prefixDir,
+	      ScanDirection postfixDir);
+  </programlisting>
+     Fetch the next tuple in the given scan, moving in the given
+     directions. Directions are specified by the direction of the prefix we're moving in,
+     of which the size of the prefix has been specified in the <function>btbeginscan_skip</function>
+     call. This direction can be different in DISTINCT scans when fetching backwards
+     from a cursor.
+     Returns true if a tuple was
+     obtained, false if no matching tuples remain.  In the true case the tuple
+     TID is stored into the <literal>scan</literal> structure.  Note that
+     <quote>success</quote> means only that the index contains an entry that matches
+     the scan keys, not that the tuple necessarily still exists in the heap or
+     will pass the caller's snapshot test.  On success, <function>amgettuple</function>
+     must also set <literal>scan-&gt;xs_recheck</literal> to true or false.
+     False means it is certain that the index entry matches the scan keys.
+     true means this is not certain, and the conditions represented by the
+     scan keys must be rechecked against the heap tuple after fetching it.
+     This provision supports <quote>lossy</quote> index operators.
+     Note that rechecking will extend only to the scan conditions; a partial
+     index predicate (if any) is never rechecked by <function>amgettuple</function>
+     callers.
+    </para>
+
+    <para>
+     If the index supports <link linkend="indexes-index-only-scans">index-only
+     scans</link> (i.e., <function>amcanreturn</function> returns true for it),
+     then on success the AM must also check <literal>scan-&gt;xs_want_itup</literal>,
+     and if that is true it must return the originally indexed data for the
+     index entry.  The data can be returned in the form of an
+     <structname>IndexTuple</structname> pointer stored at <literal>scan-&gt;xs_itup</literal>,
+     with tuple descriptor <literal>scan-&gt;xs_itupdesc</literal>; or in the form of
+     a <structname>HeapTuple</structname> pointer stored at <literal>scan-&gt;xs_hitup</literal>,
+     with tuple descriptor <literal>scan-&gt;xs_hitupdesc</literal>.  (The latter
+     format should be used when reconstructing data that might possibly not fit
+     into an <structname>IndexTuple</structname>.)  In either case,
+     management of the data referenced by the pointer is the access method's
+     responsibility.  The data must remain good at least until the next
+     <function>amgettuple</function>, <function>amrescan</function>, or <function>amendscan</function>
+     call for the scan.
+    </para>
+
+    <para>
+     The <function>amgettuple</function> function need only be provided if the access
+     method supports <quote>plain</quote> index scans.  If it doesn't, the
+     <structfield>amgettuple</structfield> field in its <structname>IndexAmRoutine</structname>
+     struct must be set to NULL.
+    </para>
+
   <para>
    In addition to supporting ordinary index scans, some types of index
    may wish to support <firstterm>parallel index scans</firstterm>, which allow
@@ -777,7 +896,7 @@ amrestrpos (IndexScanDesc scan);
    functions may be implemented to support parallel index scans:
   </para>
 
-  <para>
+    <para>
 <programlisting>
 Size
 amestimateparallelscan (void);
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 023157d888..0fd48c6a6f 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1297,6 +1297,34 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
    and later will recognize such cases and allow index-only scans to be
    generated, but older versions will not.
   </para>
+
+  <sect2 id="indexes-index-skip-scans">
+    <title>Index Skip Scans</title>
+
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index</primary>
+      <secondary>index-skip scans</secondary>
+    </indexterm>
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index-skip scan</primary>
+    </indexterm>
+
+    <para>
+     When the rows retrieved from an index scan are then deduplicated by
+     eliminating rows matching on a prefix of index keys (e.g. when using
+     <literal>SELECT DISTINCT</literal>), the planner will consider
+     skipping groups of rows with a matching key prefix. When a row with
+     a particular prefix is found, remaining rows with the same key prefix
+     are skipped.  The larger the number of rows with the same key prefix
+     rows (i.e. the lower the number of distinct key prefixes in the index),
+     the more efficient this is.
+    </para>
+    <para>
+      Additionally, a skip scan can be considered in regular <literal>SELECT</literal>
+      queries. When filtering on an non-leading attribute of an index, the planner
+      may choose a skip scan.
+    </para>
+  </sect2>
  </sect1>
 
 
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ba78ecff66..43f1f46de9 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -119,6 +119,9 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = brincostestimate;
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 3d15701a01..d1bac43d7f 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -67,6 +67,9 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = gincostestimate;
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index c3cdfca9a2..cd9a45bcf6 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -88,6 +88,9 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = gistcostestimate;
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index d48c8a4549..55d5ca5804 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -85,6 +85,9 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = hashcostestimate;
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index fe80b8b0ba..c060e01b44 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -14,7 +14,9 @@
  *		index_open		- open an index relation by relation OID
  *		index_close		- close an index relation
  *		index_beginscan - start a scan of an index with amgettuple
+ *		index_beginscan_skip - start a scan of an index with amgettuple and skipping
  *		index_beginscan_bitmap - start a scan of an index with amgetbitmap
+ *		index_beginscan_bitmap_skip - start a skip scan of an index with amgetbitmap
  *		index_rescan	- restart a scan of an index
  *		index_endscan	- end a scan
  *		index_insert	- insert an index tuple into a relation
@@ -25,14 +27,17 @@
  *		index_parallelrescan  - (re)start a parallel scan of an index
  *		index_beginscan_parallel - join parallel index scan
  *		index_getnext_tid	- get the next TID from a scan
+ *		index_getnext_tid_skip	- get the next TID from a skip scan
  *		index_fetch_heap		- get the scan's next heap tuple
  *		index_getnext_slot	- get the next tuple from a scan
+ *		index_getnext_slot	- get the next tuple from a skip scan
  *		index_getbitmap - get all tuples from a scan
  *		index_bulk_delete	- bulk deletion of index tuples
  *		index_vacuum_cleanup	- post-deletion cleanup of an index
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_skip		- advance past duplicate key values in a scan
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -224,6 +229,78 @@ index_beginscan(Relation heapRelation,
 	return scan;
 }
 
+static IndexScanDesc
+index_beginscan_internal_skip(Relation indexRelation,
+						 int nkeys, int norderbys, int prefix, Snapshot snapshot,
+						 ParallelIndexScanDesc pscan, bool temp_snap)
+{
+	IndexScanDesc scan;
+
+	RELATION_CHECKS;
+	CHECK_REL_PROCEDURE(ambeginskipscan);
+
+	if (!(indexRelation->rd_indam->ampredlocks))
+		PredicateLockRelation(indexRelation, snapshot);
+
+	/*
+	 * We hold a reference count to the relcache entry throughout the scan.
+	 */
+	RelationIncrementReferenceCount(indexRelation);
+
+	/*
+	 * Tell the AM to open a scan.
+	 */
+	scan = indexRelation->rd_indam->ambeginskipscan(indexRelation, nkeys,
+												norderbys, prefix);
+	/* Initialize information for parallel scan. */
+	scan->parallel_scan = pscan;
+	scan->xs_temp_snap = temp_snap;
+
+	return scan;
+}
+
+IndexScanDesc
+index_beginscan_skip(Relation heapRelation,
+				Relation indexRelation,
+				Snapshot snapshot,
+				int nkeys, int norderbys, int prefix)
+{
+	IndexScanDesc scan;
+
+	scan = index_beginscan_internal_skip(indexRelation, nkeys, norderbys, prefix, snapshot, NULL, false);
+
+	/*
+	 * Save additional parameters into the scandesc.  Everything else was set
+	 * up by RelationGetIndexScan.
+	 */
+	scan->heapRelation = heapRelation;
+	scan->xs_snapshot = snapshot;
+
+	/* prepare to fetch index matches from table */
+	scan->xs_heapfetch = table_index_fetch_begin(heapRelation);
+
+	return scan;
+}
+
+IndexScanDesc
+index_beginscan_bitmap_skip(Relation indexRelation,
+					   Snapshot snapshot,
+					   int nkeys,
+					   int prefix)
+{
+	IndexScanDesc scan;
+
+	scan = index_beginscan_internal_skip(indexRelation, nkeys, 0, prefix, snapshot, NULL, false);
+
+	/*
+	 * Save additional parameters into the scandesc.  Everything else was set
+	 * up by RelationGetIndexScan.
+	 */
+	scan->xs_snapshot = snapshot;
+
+	return scan;
+}
+
 /*
  * index_beginscan_bitmap - start a scan of an index with amgetbitmap
  *
@@ -553,6 +630,45 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	return &scan->xs_heaptid;
 }
 
+ItemPointer
+index_getnext_tid_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	bool		found;
+
+	SCAN_CHECKS;
+	CHECK_SCAN_PROCEDURE(amgetskiptuple);
+
+	Assert(TransactionIdIsValid(RecentXmin));
+
+	/*
+	 * The AM's amgettuple proc finds the next index entry matching the scan
+	 * keys, and puts the TID into scan->xs_heaptid.  It should also set
+	 * scan->xs_recheck and possibly scan->xs_itup/scan->xs_hitup, though we
+	 * pay no attention to those fields here.
+	 */
+	found = scan->indexRelation->rd_indam->amgetskiptuple(scan, prefixDir, postfixDir);
+
+	/* Reset kill flag immediately for safety */
+	scan->kill_prior_tuple = false;
+	scan->xs_heap_continue = false;
+
+	/* If we're out of index entries, we're done */
+	if (!found)
+	{
+		/* release resources (like buffer pins) from table accesses */
+		if (scan->xs_heapfetch)
+			table_index_fetch_reset(scan->xs_heapfetch);
+
+		return NULL;
+	}
+	Assert(ItemPointerIsValid(&scan->xs_heaptid));
+
+	pgstat_count_index_tuples(scan->indexRelation, 1);
+
+	/* Return the TID of the tuple we found. */
+	return &scan->xs_heaptid;
+}
+
 /* ----------------
  *		index_fetch_heap - get the scan's next heap tuple
  *
@@ -644,6 +760,38 @@ index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *
 	return false;
 }
 
+bool
+index_getnext_slot_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir, TupleTableSlot *slot)
+{
+	for (;;)
+	{
+		if (!scan->xs_heap_continue)
+		{
+			ItemPointer tid;
+
+			/* Time to fetch the next TID from the index */
+			tid = index_getnext_tid_skip(scan, prefixDir, postfixDir);
+
+			/* If we're out of index entries, we're done */
+			if (tid == NULL)
+				break;
+
+			Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+		}
+
+		/*
+		 * Fetch the next (or only) visible heap tuple for this index entry.
+		 * If we don't find anything, loop around and grab the next TID from
+		 * the index.
+		 */
+		Assert(ItemPointerIsValid(&scan->xs_heaptid));
+		if (index_fetch_heap(scan, slot))
+			return true;
+	}
+
+	return false;
+}
+
 /* ----------------
  *		index_getbitmap - get all tuples at once from an index scan
  *
@@ -739,6 +887,21 @@ index_can_return(Relation indexRelation, int attno)
 	return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
 }
 
+/* ----------------
+ *		index_skip
+ *
+ *		Skip past all tuples where the first 'prefix' columns have the
+ *		same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	SCAN_CHECKS;
+
+	return scan->indexRelation->rd_indam->amskip(scan, prefixDir, postfixDir);
+}
+
 /* ----------------
  *		index_getprocid
  *
diff --git a/src/backend/access/nbtree/Makefile b/src/backend/access/nbtree/Makefile
index d69808e78c..da96ac00a6 100644
--- a/src/backend/access/nbtree/Makefile
+++ b/src/backend/access/nbtree/Makefile
@@ -19,6 +19,7 @@ OBJS = \
 	nbtpage.o \
 	nbtree.o \
 	nbtsearch.o \
+	nbtskip.o \
 	nbtsort.o \
 	nbtsplitloc.o \
 	nbtutils.o \
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 62746c4721..6af52431ab 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -108,7 +108,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	bool		checkingunique = (checkUnique != UNIQUE_CHECK_NO);
 
 	/* we need an insertion scan key to do our search, so build one */
-	itup_key = _bt_mkscankey(rel, itup);
+	itup_key = _bt_mkscankey(rel, itup, NULL);
 
 	if (checkingunique)
 	{
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 6b5f01e1d0..284589aa28 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1967,7 +1967,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 				}
 
 				/* we need an insertion scan key for the search, so build one */
-				itup_key = _bt_mkscankey(rel, targetkey);
+				itup_key = _bt_mkscankey(rel, targetkey, NULL);
 				/* find the leftmost leaf page with matching pivot/high key */
 				itup_key->pivotsearch = true;
 				stack = _bt_search(rel, itup_key, &sleafbuf, BT_READ, NULL);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 13024af2fa..5c96efa747 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -124,6 +124,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
+	amroutine->amskip = btskip;
 	amroutine->amcostestimate = btcostestimate;
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
@@ -131,8 +132,10 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amvalidate = btvalidate;
 	amroutine->amadjustmembers = btadjustmembers;
 	amroutine->ambeginscan = btbeginscan;
+	amroutine->ambeginskipscan = btbeginscan_skip;
 	amroutine->amrescan = btrescan;
 	amroutine->amgettuple = btgettuple;
+	amroutine->amgetskiptuple = btgettuple_skip;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
@@ -209,6 +212,15 @@ btinsert(Relation rel, Datum *values, bool *isnull,
  */
 bool
 btgettuple(IndexScanDesc scan, ScanDirection dir)
+{
+	return btgettuple_skip(scan, dir, dir);
+}
+
+/*
+ *	btgettuple() -- Get the next tuple in the scan.
+ */
+bool
+btgettuple_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		res;
@@ -227,7 +239,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (so->numArrayKeys < 0)
 			return false;
 
-		_bt_start_array_keys(scan, dir);
+		_bt_start_array_keys(scan, prefixDir);
 	}
 
 	/* This loop handles advancing to the next array elements, if any */
@@ -239,7 +251,7 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 		 * _bt_first() to get the first item in the scan.
 		 */
 		if (!BTScanPosIsValid(so->currPos))
-			res = _bt_first(scan, dir);
+			res = _bt_first(scan, prefixDir, postfixDir);
 		else
 		{
 			/*
@@ -266,14 +278,14 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 			/*
 			 * Now continue the scan.
 			 */
-			res = _bt_next(scan, dir);
+			res = _bt_next(scan, prefixDir, postfixDir);
 		}
 
 		/* If we have a tuple, return it ... */
 		if (res)
 			break;
 		/* ... otherwise see if we have more array keys to deal with */
-	} while (so->numArrayKeys && _bt_advance_array_keys(scan, dir));
+	} while (so->numArrayKeys && _bt_advance_array_keys(scan, prefixDir));
 
 	return res;
 }
@@ -304,7 +316,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	do
 	{
 		/* Fetch the first page & tuple */
-		if (_bt_first(scan, ForwardScanDirection))
+		if (_bt_first(scan, ForwardScanDirection, ForwardScanDirection))
 		{
 			/* Save tuple ID, and continue scanning */
 			heapTid = &scan->xs_heaptid;
@@ -320,7 +332,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 				if (++so->currPos.itemIndex > so->currPos.lastItem)
 				{
 					/* let _bt_next do the heavy lifting */
-					if (!_bt_next(scan, ForwardScanDirection))
+					if (!_bt_next(scan, ForwardScanDirection, ForwardScanDirection))
 						break;
 				}
 
@@ -341,6 +353,16 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
  */
 IndexScanDesc
 btbeginscan(Relation rel, int nkeys, int norderbys)
+{
+	return btbeginscan_skip(rel, nkeys, norderbys, -1);
+}
+
+
+/*
+ *	btbeginscan() -- start a scan on a btree index
+ */
+IndexScanDesc
+btbeginscan_skip(Relation rel, int nkeys, int norderbys, int skipPrefix)
 {
 	IndexScanDesc scan;
 	BTScanOpaque so;
@@ -375,10 +397,20 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	 */
 	so->currTuples = so->markTuples = NULL;
 
+	so->skipData = NULL;
+
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
 
+	if (skipPrefix > 0)
+	{
+		so->skipData = (BTSkip) palloc0(sizeof(BTSkipData));
+		so->skipData->prefix = skipPrefix;
+
+		elog(DEBUG1, "skip prefix: %d", skipPrefix);
+	}
+
 	return scan;
 }
 
@@ -441,6 +473,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	_bt_preprocess_array_keys(scan);
 }
 
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	return _bt_skip(scan, prefixDir, postfixDir);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
@@ -474,6 +515,8 @@ btendscan(IndexScanDesc scan)
 	if (so->currTuples != NULL)
 		pfree(so->currTuples);
 	/* so->markTuples should not be pfree'd, see btrescan */
+	if (_bt_skip_enabled(so))
+		pfree(so->skipData);
 	pfree(so);
 }
 
@@ -557,6 +600,9 @@ btrestrpos(IndexScanDesc scan)
 			if (so->currTuples)
 				memcpy(so->currTuples, so->markTuples,
 					   so->markPos.nextTupleOffset);
+			if (so->skipData)
+				memcpy(&so->skipData->curPos, &so->skipData->markPos,
+					   sizeof(BTSkipPosData));
 		}
 		else
 			BTScanPosInvalidate(so->currPos);
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 9d82d4904d..4f56fafd76 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -17,19 +17,17 @@
 
 #include "access/nbtree.h"
 #include "access/relscan.h"
+#include "catalog/catalog.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/predicate.h"
+#include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/rel.h"
 
 
-static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
-static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
-static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
-						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
 static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
@@ -38,14 +36,12 @@ static int	_bt_setuppostingitems(BTScanOpaque so, int itemIndex,
 static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
 									   OffsetNumber offnum,
 									   ItemPointer heapTid, int tupleOffset);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
-static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 								  ScanDirection dir);
-static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
-static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline bool _bt_checkkeys_extended(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+										  ScanDirection dir, bool isRegularMode,
+										  bool *continuescan, int *prefixskipindex);
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -56,7 +52,7 @@ static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
  *
  * See nbtree/README section on making concurrent TID recycling safe.
  */
-static void
+void
 _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
 {
 	_bt_unlockbuf(scan->indexRelation, sp->buf);
@@ -334,7 +330,7 @@ _bt_moveright(Relation rel,
  * the given page.  _bt_binsrch() has no lock or refcount side effects
  * on the buffer.
  */
-static OffsetNumber
+OffsetNumber
 _bt_binsrch(Relation rel,
 			BTScanInsert key,
 			Buffer buf)
@@ -857,25 +853,23 @@ _bt_compare(Relation rel,
  * in locating the scan start position.
  */
 bool
-_bt_first(IndexScanDesc scan, ScanDirection dir)
+_bt_first(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
 {
 	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Buffer		buf;
 	BTStack		stack;
 	OffsetNumber offnum;
-	StrategyNumber strat;
-	bool		nextkey;
 	bool		goback;
 	BTScanInsertData inskey;
 	ScanKey		startKeys[INDEX_MAX_KEYS];
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
-	int			i;
 	bool		status;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
 	BlockNumber blkno;
+	IndexTuple itup;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -916,184 +910,13 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 		else if (blkno != InvalidBlockNumber)
 		{
-			if (!_bt_parallel_readpage(scan, blkno, dir))
+			if (!_bt_parallel_readpage(scan, blkno, prefixDir))
 				return false;
 			goto readcomplete;
 		}
 	}
 
-	/*----------
-	 * Examine the scan keys to discover where we need to start the scan.
-	 *
-	 * We want to identify the keys that can be used as starting boundaries;
-	 * these are =, >, or >= keys for a forward scan or =, <, <= keys for
-	 * a backwards scan.  We can use keys for multiple attributes so long as
-	 * the prior attributes had only =, >= (resp. =, <=) keys.  Once we accept
-	 * a > or < boundary or find an attribute with no boundary (which can be
-	 * thought of as the same as "> -infinity"), we can't use keys for any
-	 * attributes to its right, because it would break our simplistic notion
-	 * of what initial positioning strategy to use.
-	 *
-	 * When the scan keys include cross-type operators, _bt_preprocess_keys
-	 * may not be able to eliminate redundant keys; in such cases we will
-	 * arbitrarily pick a usable one for each attribute.  This is correct
-	 * but possibly not optimal behavior.  (For example, with keys like
-	 * "x >= 4 AND x >= 5" we would elect to scan starting at x=4 when
-	 * x=5 would be more efficient.)  Since the situation only arises given
-	 * a poorly-worded query plus an incomplete opfamily, live with it.
-	 *
-	 * When both equality and inequality keys appear for a single attribute
-	 * (again, only possible when cross-type operators appear), we *must*
-	 * select one of the equality keys for the starting point, because
-	 * _bt_checkkeys() will stop the scan as soon as an equality qual fails.
-	 * For example, if we have keys like "x >= 4 AND x = 10" and we elect to
-	 * start at x=4, we will fail and stop before reaching x=10.  If multiple
-	 * equality quals survive preprocessing, however, it doesn't matter which
-	 * one we use --- by definition, they are either redundant or
-	 * contradictory.
-	 *
-	 * Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
-	 * If the index stores nulls at the end of the index we'll be starting
-	 * from, and we have no boundary key for the column (which means the key
-	 * we deduced NOT NULL from is an inequality key that constrains the other
-	 * end of the index), then we cons up an explicit SK_SEARCHNOTNULL key to
-	 * use as a boundary key.  If we didn't do this, we might find ourselves
-	 * traversing a lot of null entries at the start of the scan.
-	 *
-	 * In this loop, row-comparison keys are treated the same as keys on their
-	 * first (leftmost) columns.  We'll add on lower-order columns of the row
-	 * comparison below, if possible.
-	 *
-	 * The selected scan keys (at most one per index column) are remembered by
-	 * storing their addresses into the local startKeys[] array.
-	 *----------
-	 */
-	strat_total = BTEqualStrategyNumber;
-	if (so->numberOfKeys > 0)
-	{
-		AttrNumber	curattr;
-		ScanKey		chosen;
-		ScanKey		impliesNN;
-		ScanKey		cur;
-
-		/*
-		 * chosen is the so-far-chosen key for the current attribute, if any.
-		 * We don't cast the decision in stone until we reach keys for the
-		 * next attribute.
-		 */
-		curattr = 1;
-		chosen = NULL;
-		/* Also remember any scankey that implies a NOT NULL constraint */
-		impliesNN = NULL;
-
-		/*
-		 * Loop iterates from 0 to numberOfKeys inclusive; we use the last
-		 * pass to handle after-last-key processing.  Actual exit from the
-		 * loop is at one of the "break" statements below.
-		 */
-		for (cur = so->keyData, i = 0;; cur++, i++)
-		{
-			if (i >= so->numberOfKeys || cur->sk_attno != curattr)
-			{
-				/*
-				 * Done looking at keys for curattr.  If we didn't find a
-				 * usable boundary key, see if we can deduce a NOT NULL key.
-				 */
-				if (chosen == NULL && impliesNN != NULL &&
-					((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
-					 ScanDirectionIsForward(dir) :
-					 ScanDirectionIsBackward(dir)))
-				{
-					/* Yes, so build the key in notnullkeys[keysCount] */
-					chosen = &notnullkeys[keysCount];
-					ScanKeyEntryInitialize(chosen,
-										   (SK_SEARCHNOTNULL | SK_ISNULL |
-											(impliesNN->sk_flags &
-											 (SK_BT_DESC | SK_BT_NULLS_FIRST))),
-										   curattr,
-										   ((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
-											BTGreaterStrategyNumber :
-											BTLessStrategyNumber),
-										   InvalidOid,
-										   InvalidOid,
-										   InvalidOid,
-										   (Datum) 0);
-				}
-
-				/*
-				 * If we still didn't find a usable boundary key, quit; else
-				 * save the boundary key pointer in startKeys.
-				 */
-				if (chosen == NULL)
-					break;
-				startKeys[keysCount++] = chosen;
-
-				/*
-				 * Adjust strat_total, and quit if we have stored a > or <
-				 * key.
-				 */
-				strat = chosen->sk_strategy;
-				if (strat != BTEqualStrategyNumber)
-				{
-					strat_total = strat;
-					if (strat == BTGreaterStrategyNumber ||
-						strat == BTLessStrategyNumber)
-						break;
-				}
-
-				/*
-				 * Done if that was the last attribute, or if next key is not
-				 * in sequence (implying no boundary key is available for the
-				 * next attribute).
-				 */
-				if (i >= so->numberOfKeys ||
-					cur->sk_attno != curattr + 1)
-					break;
-
-				/*
-				 * Reset for next attr.
-				 */
-				curattr = cur->sk_attno;
-				chosen = NULL;
-				impliesNN = NULL;
-			}
-
-			/*
-			 * Can we use this key as a starting boundary for this attr?
-			 *
-			 * If not, does it imply a NOT NULL constraint?  (Because
-			 * SK_SEARCHNULL keys are always assigned BTEqualStrategyNumber,
-			 * *any* inequality key works for that; we need not test.)
-			 */
-			switch (cur->sk_strategy)
-			{
-				case BTLessStrategyNumber:
-				case BTLessEqualStrategyNumber:
-					if (chosen == NULL)
-					{
-						if (ScanDirectionIsBackward(dir))
-							chosen = cur;
-						else
-							impliesNN = cur;
-					}
-					break;
-				case BTEqualStrategyNumber:
-					/* override any non-equality choice */
-					chosen = cur;
-					break;
-				case BTGreaterEqualStrategyNumber:
-				case BTGreaterStrategyNumber:
-					if (chosen == NULL)
-					{
-						if (ScanDirectionIsForward(dir))
-							chosen = cur;
-						else
-							impliesNN = cur;
-					}
-					break;
-			}
-		}
-	}
+	keysCount = _bt_choose_scan_keys(so->keyData, so->numberOfKeys, prefixDir, startKeys, notnullkeys, &strat_total, 0);
 
 	/*
 	 * If we found no usable boundary keys, we have to start from one end of
@@ -1104,260 +927,112 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	{
 		bool		match;
 
-		match = _bt_endpoint(scan, dir);
-
-		if (!match)
+		if (!_bt_skip_enabled(so))
 		{
-			/* No match, so mark (parallel) scan finished */
-			_bt_parallel_done(scan);
-		}
+			match = _bt_endpoint(scan, prefixDir);
 
-		return match;
-	}
+			if (!match)
+			{
+				/* No match, so mark (parallel) scan finished */
+				_bt_parallel_done(scan);
+			}
 
-	/*
-	 * We want to start the scan somewhere within the index.  Set up an
-	 * insertion scankey we can use to search for the boundary point we
-	 * identified above.  The insertion scankey is built using the keys
-	 * identified by startKeys[].  (Remaining insertion scankey fields are
-	 * initialized after initial-positioning strategy is finalized.)
-	 */
-	Assert(keysCount <= INDEX_MAX_KEYS);
-	for (i = 0; i < keysCount; i++)
-	{
-		ScanKey		cur = startKeys[i];
+			return match;
+		}
+		else
+		{
+			Relation	rel = scan->indexRelation;
+			Buffer		buf;
+			Page		page;
+			BTPageOpaque opaque;
+			OffsetNumber start;
+			BTSkipCompareResult cmp = {0};
 
-		Assert(cur->sk_attno == i + 1);
+			_bt_skip_create_scankeys(rel, so);
 
-		if (cur->sk_flags & SK_ROW_HEADER)
-		{
 			/*
-			 * Row comparison header: look to the first row member instead.
-			 *
-			 * The member scankeys are already in insertion format (ie, they
-			 * have sk_func = 3-way-comparison function), but we have to watch
-			 * out for nulls, which _bt_preprocess_keys didn't check. A null
-			 * in the first row member makes the condition unmatchable, just
-			 * like qual_ok = false.
+			 * Scan down to the leftmost or rightmost leaf page and position
+			 * the scan on the leftmost or rightmost item on that page.
+			 * Start the skip scan from there to find the first matching item
 			 */
-			ScanKey		subkey = (ScanKey) DatumGetPointer(cur->sk_argument);
+			buf = _bt_get_endpoint(rel, 0, ScanDirectionIsBackward(prefixDir), scan->xs_snapshot);
 
-			Assert(subkey->sk_flags & SK_ROW_MEMBER);
-			if (subkey->sk_flags & SK_ISNULL)
+			if (!BufferIsValid(buf))
 			{
-				_bt_parallel_done(scan);
+				/*
+				 * Empty index. Lock the whole relation, as nothing finer to lock
+				 * exists.
+				 */
+				PredicateLockRelation(rel, scan->xs_snapshot);
+				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
-			memcpy(inskey.scankeys + i, subkey, sizeof(ScanKeyData));
 
-			/*
-			 * If the row comparison is the last positioning key we accepted,
-			 * try to add additional keys from the lower-order row members.
-			 * (If we accepted independent conditions on additional index
-			 * columns, we use those instead --- doesn't seem worth trying to
-			 * determine which is more restrictive.)  Note that this is OK
-			 * even if the row comparison is of ">" or "<" type, because the
-			 * condition applied to all but the last row member is effectively
-			 * ">=" or "<=", and so the extra keys don't break the positioning
-			 * scheme.  But, by the same token, if we aren't able to use all
-			 * the row members, then the part of the row comparison that we
-			 * did use has to be treated as just a ">=" or "<=" condition, and
-			 * so we'd better adjust strat_total accordingly.
-			 */
-			if (i == keysCount - 1)
+			PredicateLockPage(rel, BufferGetBlockNumber(buf), scan->xs_snapshot);
+			page = BufferGetPage(buf);
+			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			Assert(P_ISLEAF(opaque));
+
+			if (ScanDirectionIsForward(prefixDir))
 			{
-				bool		used_all_subkeys = false;
+				/* There could be dead pages to the left, so not this: */
+				/* Assert(P_LEFTMOST(opaque)); */
 
-				Assert(!(subkey->sk_flags & SK_ROW_END));
-				for (;;)
-				{
-					subkey++;
-					Assert(subkey->sk_flags & SK_ROW_MEMBER);
-					if (subkey->sk_attno != keysCount + 1)
-						break;	/* out-of-sequence, can't use it */
-					if (subkey->sk_strategy != cur->sk_strategy)
-						break;	/* wrong direction, can't use it */
-					if (subkey->sk_flags & SK_ISNULL)
-						break;	/* can't use null keys */
-					Assert(keysCount < INDEX_MAX_KEYS);
-					memcpy(inskey.scankeys + keysCount, subkey,
-						   sizeof(ScanKeyData));
-					keysCount++;
-					if (subkey->sk_flags & SK_ROW_END)
-					{
-						used_all_subkeys = true;
-						break;
-					}
-				}
-				if (!used_all_subkeys)
-				{
-					switch (strat_total)
-					{
-						case BTLessStrategyNumber:
-							strat_total = BTLessEqualStrategyNumber;
-							break;
-						case BTGreaterStrategyNumber:
-							strat_total = BTGreaterEqualStrategyNumber;
-							break;
-					}
-				}
-				break;			/* done with outer loop */
+				start = P_FIRSTDATAKEY(opaque);
 			}
-		}
-		else
-		{
-			/*
-			 * Ordinary comparison key.  Transform the search-style scan key
-			 * to an insertion scan key by replacing the sk_func with the
-			 * appropriate btree comparison function.
-			 *
-			 * If scankey operator is not a cross-type comparison, we can use
-			 * the cached comparison function; otherwise gotta look it up in
-			 * the catalogs.  (That can't lead to infinite recursion, since no
-			 * indexscan initiated by syscache lookup will use cross-data-type
-			 * operators.)
-			 *
-			 * We support the convention that sk_subtype == InvalidOid means
-			 * the opclass input type; this is a hack to simplify life for
-			 * ScanKeyInit().
-			 */
-			if (cur->sk_subtype == rel->rd_opcintype[i] ||
-				cur->sk_subtype == InvalidOid)
+			else if (ScanDirectionIsBackward(prefixDir))
 			{
-				FmgrInfo   *procinfo;
-
-				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
-				ScanKeyEntryInitializeWithInfo(inskey.scankeys + i,
-											   cur->sk_flags,
-											   cur->sk_attno,
-											   InvalidStrategy,
-											   cur->sk_subtype,
-											   cur->sk_collation,
-											   procinfo,
-											   cur->sk_argument);
+				Assert(P_RIGHTMOST(opaque));
+
+				start = PageGetMaxOffsetNumber(page);
 			}
 			else
 			{
-				RegProcedure cmp_proc;
-
-				cmp_proc = get_opfamily_proc(rel->rd_opfamily[i],
-											 rel->rd_opcintype[i],
-											 cur->sk_subtype,
-											 BTORDER_PROC);
-				if (!RegProcedureIsValid(cmp_proc))
-					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
-						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
-						 cur->sk_attno, RelationGetRelationName(rel));
-				ScanKeyEntryInitialize(inskey.scankeys + i,
-									   cur->sk_flags,
-									   cur->sk_attno,
-									   InvalidStrategy,
-									   cur->sk_subtype,
-									   cur->sk_collation,
-									   cmp_proc,
-									   cur->sk_argument);
+				elog(ERROR, "invalid scan direction: %d", (int) prefixDir);
 			}
-		}
-	}
 
-	/*----------
-	 * Examine the selected initial-positioning strategy to determine exactly
-	 * where we need to start the scan, and set flag variables to control the
-	 * code below.
-	 *
-	 * If nextkey = false, _bt_search and _bt_binsrch will locate the first
-	 * item >= scan key.  If nextkey = true, they will locate the first
-	 * item > scan key.
-	 *
-	 * If goback = true, we will then step back one item, while if
-	 * goback = false, we will start the scan on the located item.
-	 *----------
-	 */
-	switch (strat_total)
-	{
-		case BTLessStrategyNumber:
-
-			/*
-			 * Find first item >= scankey, then back up one to arrive at last
-			 * item < scankey.  (Note: this positioning strategy is only used
-			 * for a backward scan, so that is always the correct starting
-			 * position.)
-			 */
-			nextkey = false;
-			goback = true;
-			break;
-
-		case BTLessEqualStrategyNumber:
-
-			/*
-			 * Find first item > scankey, then back up one to arrive at last
-			 * item <= scankey.  (Note: this positioning strategy is only used
-			 * for a backward scan, so that is always the correct starting
-			 * position.)
-			 */
-			nextkey = true;
-			goback = true;
-			break;
-
-		case BTEqualStrategyNumber:
-
-			/*
-			 * If a backward scan was specified, need to start with last equal
-			 * item not first one.
+			/* remember which buffer we have pinned */
+			so->currPos.buf = buf;
+			so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+			itup = _bt_get_tuple_from_offset(so, start);
+			/* in some cases, we can (or have to) skip further inside the prefix.
+			 * we can do this if we have extra quals becoming available, eg.
+			 * WHERE b=2 on an index on (a,b).
+			 * We must, if this is not regular mode (prefixDir!=postfixDir).
+			 * Because this means we're at the end of the prefix, while we should be
+			 * at the beginning.
 			 */
-			if (ScanDirectionIsBackward(dir))
+			if (_bt_has_extra_quals_after_skip(so->skipData, postfixDir, 0) ||
+					!_bt_skip_is_regular_mode(prefixDir, postfixDir))
 			{
-				/*
-				 * This is the same as the <= strategy.  We will check at the
-				 * end whether the found item is actually =.
-				 */
-				nextkey = true;
-				goback = true;
+				_bt_skip_extra_conditions(scan, &itup, &start, prefixDir, postfixDir, &cmp);
 			}
-			else
+			/* now find the next matching tuple */
+			match = _bt_skip_find_next(scan, itup, start, prefixDir, postfixDir);
+			if (!match)
 			{
-				/*
-				 * This is the same as the >= strategy.  We will check at the
-				 * end whether the found item is actually =.
-				 */
-				nextkey = false;
-				goback = false;
+				if (_bt_skip_is_always_valid(so))
+					_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+				return false;
 			}
-			break;
 
-		case BTGreaterEqualStrategyNumber:
+			_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
-			/*
-			 * Find first item >= scankey.  (This is only used for forward
-			 * scans.)
-			 */
-			nextkey = false;
-			goback = false;
-			break;
-
-		case BTGreaterStrategyNumber:
-
-			/*
-			 * Find first item > scankey.  (This is only used for forward
-			 * scans.)
-			 */
-			nextkey = true;
-			goback = false;
-			break;
+			currItem = &so->currPos.items[so->currPos.itemIndex];
+			scan->xs_heaptid = currItem->heapTid;
+			if (scan->xs_want_itup)
+				scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
 
-		default:
-			/* can't get here, but keep compiler quiet */
-			elog(ERROR, "unrecognized strat_total: %d", (int) strat_total);
-			return false;
+			return true;
+		}
 	}
 
-	/* Initialize remaining insertion scan key fields */
-	_bt_metaversion(rel, &inskey.heapkeyspace, &inskey.allequalimage);
-	inskey.anynullkeys = false; /* unused */
-	inskey.nextkey = nextkey;
-	inskey.pivotsearch = false;
-	inskey.scantid = NULL;
-	inskey.keysz = keysCount;
+	if (!_bt_create_insertion_scan_key(rel, prefixDir, startKeys, keysCount, &inskey, &strat_total,  &goback))
+	{
+		_bt_parallel_done(scan);
+		return false;
+	}
 
 	/*
 	 * Use the manufactured insertion scan key to descend the tree and
@@ -1389,7 +1064,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		PredicateLockPage(rel, BufferGetBlockNumber(buf),
 						  scan->xs_snapshot);
 
-	_bt_initialize_more_data(so, dir);
+	_bt_initialize_more_data(so, prefixDir);
 
 	/* position to the precise item on the page */
 	offnum = _bt_binsrch(rel, &inskey, buf);
@@ -1419,23 +1094,81 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	Assert(!BTScanPosIsValid(so->currPos));
 	so->currPos.buf = buf;
 
-	/*
-	 * Now load data from the first page of the scan.
-	 */
-	if (!_bt_readpage(scan, dir, offnum))
+	if (_bt_skip_enabled(so))
 	{
-		/*
-		 * There's no actually-matching data on this page.  Try to advance to
-		 * the next page.  Return false if there's no matching data at all.
+		Page page;
+		BTPageOpaque opaque;
+		OffsetNumber minoff;
+		bool match;
+		BTSkipCompareResult cmp = {0};
+
+		/* first create the skip scan keys */
+		_bt_skip_create_scankeys(rel, so);
+
+		/* remember which page we have pinned */
+		so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+		page = BufferGetPage(so->currPos.buf);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		minoff = P_FIRSTDATAKEY(opaque);
+		/* _binsrch + goback parameter can leave the offnum before the first item on the page
+		 * or after the last item on the page. if that is the case we need to either step
+		 * back or forward one page
 		 */
-		_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
-		if (!_bt_steppage(scan, dir))
+		if (offnum < minoff)
+		{
+			_bt_unlockbuf(rel, so->currPos.buf);
+			if (!_bt_step_back_page(scan, &itup, &offnum))
+				return false;
+			page = BufferGetPage(so->currPos.buf);
+		}
+		else if (offnum > PageGetMaxOffsetNumber(page))
+		{
+			BlockNumber next = opaque->btpo_next;
+			_bt_unlockbuf(rel, so->currPos.buf);
+			if (!_bt_step_forward_page(scan, next, &itup, &offnum))
+				return false;
+			page = BufferGetPage(so->currPos.buf);
+		}
+
+		itup = _bt_get_tuple_from_offset(so, offnum);
+		/* check if we can skip even more because we can use new conditions */
+		if (_bt_has_extra_quals_after_skip(so->skipData, postfixDir, inskey.keysz) ||
+				!_bt_skip_is_regular_mode(prefixDir, postfixDir))
+		{
+			_bt_skip_extra_conditions(scan, &itup, &offnum, prefixDir, postfixDir, &cmp);
+		}
+		/* now find the tuple */
+		match = _bt_skip_find_next(scan, itup, offnum, prefixDir, postfixDir);
+		if (!match)
+		{
+			if (_bt_skip_is_always_valid(so))
+				_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 			return false;
+		}
+
+		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 	else
 	{
-		/* Drop the lock, and maybe the pin, on the current page */
-		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+		/*
+		 * Now load data from the first page of the scan.
+		 */
+		if (!_bt_readpage(scan, prefixDir, &offnum, true))
+		{
+			/*
+			 * There's no actually-matching data on this page.  Try to advance to
+			 * the next page.  Return false if there's no matching data at all.
+			 */
+			_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+			if (!_bt_steppage(scan, prefixDir))
+				return false;
+		}
+		else
+		{
+			/* Drop the lock, and maybe the pin, on the current page */
+			_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+		}
 	}
 
 readcomplete:
@@ -1463,29 +1196,113 @@ readcomplete:
  *		so->currPos.buf to InvalidBuffer.
  */
 bool
-_bt_next(IndexScanDesc scan, ScanDirection dir)
+_bt_next(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTScanPosItem *currItem;
 
-	/*
-	 * Advance to next tuple on current page; or if there's no more, try to
-	 * step to the next page with data.
-	 */
-	if (ScanDirectionIsForward(dir))
+	if (!_bt_skip_enabled(so))
 	{
-		if (++so->currPos.itemIndex > so->currPos.lastItem)
+		/*
+		 * Advance to next tuple on current page; or if there's no more, try to
+		 * step to the next page with data.
+		 */
+		if (ScanDirectionIsForward(prefixDir))
 		{
-			if (!_bt_steppage(scan, dir))
-				return false;
+			if (++so->currPos.itemIndex > so->currPos.lastItem)
+			{
+				if (!_bt_steppage(scan, prefixDir))
+					return false;
+			}
+		}
+		else
+		{
+			if (--so->currPos.itemIndex < so->currPos.firstItem)
+			{
+				if (!_bt_steppage(scan, prefixDir))
+					return false;
+			}
 		}
 	}
 	else
 	{
-		if (--so->currPos.itemIndex < so->currPos.firstItem)
+		bool match;
+		IndexTuple itup = NULL;
+		OffsetNumber offnum = InvalidOffsetNumber;
+
+		if (ScanDirectionIsForward(postfixDir))
 		{
-			if (!_bt_steppage(scan, dir))
-				return false;
+			if (++so->currPos.itemIndex > so->currPos.lastItem)
+			{
+				if (prefixDir != so->skipData->curPos.nextDirection)
+				{
+					/* this happens when doing a cursor scan and changing
+					 * direction in the meantime. eg. first fetch forwards,
+					 * then backwards.
+					 * we *always* just go to the next page instead of skipping,
+					 * because that's the only safe option.
+					 */
+					so->skipData->curPos.nextAction = SkipStateNext;
+					so->skipData->curPos.nextDirection = prefixDir;
+				}
+
+				if (so->skipData->curPos.nextAction == SkipStateNext)
+				{
+					/* we should just go forwards one page, no skipping is necessary */
+					if (!_bt_step_forward_page(scan, so->currPos.nextPage, &itup, &offnum))
+						return false;
+				}
+				else if (so->skipData->curPos.nextAction == SkipStateStop)
+				{
+					/* we've reached the end of the index, or we cannot find any more keys */
+					BTScanPosUnpinIfPinned(so->currPos);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+
+				/* now find the next tuple */
+				match = _bt_skip_find_next(scan, itup, offnum, prefixDir, postfixDir);
+				if (!match)
+				{
+					if (_bt_skip_is_always_valid(so))
+						_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+					return false;
+				}
+				_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+			}
+		}
+		else
+		{
+			if (--so->currPos.itemIndex < so->currPos.firstItem)
+			{
+				if (prefixDir != so->skipData->curPos.nextDirection)
+				{
+					so->skipData->curPos.nextAction = SkipStateNext;
+					so->skipData->curPos.nextDirection = prefixDir;
+				}
+
+				if (so->skipData->curPos.nextAction == SkipStateNext)
+				{
+					if (!_bt_step_back_page(scan, &itup, &offnum))
+						return false;
+				}
+				else if (so->skipData->curPos.nextAction == SkipStateStop)
+				{
+					BTScanPosUnpinIfPinned(so->currPos);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+
+				/* now find the next tuple */
+				match = _bt_skip_find_next(scan, itup, offnum, prefixDir, postfixDir);
+				if (!match)
+				{
+					if (_bt_skip_is_always_valid(so))
+						_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+					return false;
+				}
+				_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+			}
 		}
 	}
 
@@ -1517,8 +1334,8 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  *
  * Returns true if any matching items found on the page, false if none.
  */
-static bool
-_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
+bool
+_bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber *offnum, bool isRegularMode)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Page		page;
@@ -1528,6 +1345,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	int			itemIndex;
 	bool		continuescan;
 	int			indnatts;
+	int			prefixskipindex;
 
 	/*
 	 * We must have the buffer pinned and locked, but the usual macro can't be
@@ -1586,11 +1404,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		/* load items[] in ascending order */
 		itemIndex = 0;
 
-		offnum = Max(offnum, minoff);
+		*offnum = Max(*offnum, minoff);
 
-		while (offnum <= maxoff)
+		while (*offnum <= maxoff)
 		{
-			ItemId		iid = PageGetItemId(page, offnum);
+			ItemId		iid = PageGetItemId(page, *offnum);
 			IndexTuple	itup;
 
 			/*
@@ -1599,19 +1417,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
-				offnum = OffsetNumberNext(offnum);
+				*offnum = OffsetNumberNext(*offnum);
 				continue;
 			}
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
+			if (_bt_checkkeys_extended(scan, itup, indnatts, dir, isRegularMode, &continuescan, &prefixskipindex))
 			{
 				/* tuple passes all scan key conditions */
 				if (!BTreeTupleIsPosting(itup))
 				{
 					/* Remember it */
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(so, itemIndex, *offnum, itup);
 					itemIndex++;
 				}
 				else
@@ -1623,26 +1441,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 					 * TID
 					 */
 					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+						_bt_setuppostingitems(so, itemIndex, *offnum,
 											  BTreeTupleGetPostingN(itup, 0),
 											  itup);
 					itemIndex++;
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(so, itemIndex, *offnum,
 											BTreeTupleGetPostingN(itup, i),
 											tupleOffset);
 						itemIndex++;
 					}
 				}
 			}
+
+			*offnum = OffsetNumberNext(*offnum);
+
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
 				break;
-
-			offnum = OffsetNumberNext(offnum);
+			if (!isRegularMode && prefixskipindex != -1)
+				break;
 		}
+		*offnum = OffsetNumberPrev(*offnum);
 
 		/*
 		 * We don't need to visit page to the right when the high key
@@ -1662,7 +1484,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			int			truncatt;
 
 			truncatt = BTreeTupleGetNAtts(itup, scan->indexRelation);
-			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan);
+			_bt_checkkeys(scan, itup, truncatt, dir, &continuescan, NULL);
 		}
 
 		if (!continuescan)
@@ -1678,11 +1500,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		/* load items[] in descending order */
 		itemIndex = MaxTIDsPerBTreePage;
 
-		offnum = Min(offnum, maxoff);
+		*offnum = Min(*offnum, maxoff);
 
-		while (offnum >= minoff)
+		while (*offnum >= minoff)
 		{
-			ItemId		iid = PageGetItemId(page, offnum);
+			ItemId		iid = PageGetItemId(page, *offnum);
 			IndexTuple	itup;
 			bool		tuple_alive;
 			bool		passes_quals;
@@ -1699,10 +1521,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			 */
 			if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
 			{
-				Assert(offnum >= P_FIRSTDATAKEY(opaque));
-				if (offnum > P_FIRSTDATAKEY(opaque))
+				Assert(*offnum >= P_FIRSTDATAKEY(opaque));
+				if (*offnum > P_FIRSTDATAKEY(opaque))
 				{
-					offnum = OffsetNumberPrev(offnum);
+					*offnum = OffsetNumberPrev(*offnum);
 					continue;
 				}
 
@@ -1713,8 +1535,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 			itup = (IndexTuple) PageGetItem(page, iid);
 
-			passes_quals = _bt_checkkeys(scan, itup, indnatts, dir,
-										 &continuescan);
+			passes_quals = _bt_checkkeys_extended(scan, itup, indnatts, dir,
+												  isRegularMode, &continuescan, &prefixskipindex);
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions */
@@ -1722,7 +1544,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 				{
 					/* Remember it */
 					itemIndex--;
-					_bt_saveitem(so, itemIndex, offnum, itup);
+					_bt_saveitem(so, itemIndex, *offnum, itup);
 				}
 				else
 				{
@@ -1740,28 +1562,32 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 					 */
 					itemIndex--;
 					tupleOffset =
-						_bt_setuppostingitems(so, itemIndex, offnum,
+						_bt_setuppostingitems(so, itemIndex, *offnum,
 											  BTreeTupleGetPostingN(itup, 0),
 											  itup);
 					/* Remember additional TIDs */
 					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
 					{
 						itemIndex--;
-						_bt_savepostingitem(so, itemIndex, offnum,
+						_bt_savepostingitem(so, itemIndex, *offnum,
 											BTreeTupleGetPostingN(itup, i),
 											tupleOffset);
 					}
 				}
 			}
+
+			*offnum = OffsetNumberPrev(*offnum);
+
 			if (!continuescan)
 			{
 				/* there can't be any more matches, so stop */
 				so->currPos.moreLeft = false;
 				break;
 			}
-
-			offnum = OffsetNumberPrev(offnum);
+			if (!isRegularMode && prefixskipindex != -1)
+				break;
 		}
+		*offnum = OffsetNumberNext(*offnum);
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
@@ -1869,7 +1695,7 @@ _bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
  * read lock, on that page.  If we do not hold the pin, we set so->currPos.buf
  * to InvalidBuffer.  We return true to indicate success.
  */
-static bool
+bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -1897,6 +1723,9 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		if (so->markTuples)
 			memcpy(so->markTuples, so->currTuples,
 				   so->currPos.nextTupleOffset);
+		if (so->skipData)
+			memcpy(&so->skipData->markPos, &so->skipData->curPos,
+				   sizeof(BTSkipPosData));
 		so->markPos.itemIndex = so->markItemIndex;
 		so->markItemIndex = -1;
 	}
@@ -1976,7 +1805,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
  * If there are no more matching records in the given direction, we drop all
  * locks and pins, set so->currPos.buf to InvalidBuffer, and return false.
  */
-static bool
+bool
 _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
@@ -1984,6 +1813,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 	Page		page;
 	BTPageOpaque opaque;
 	bool		status;
+	OffsetNumber offnum;
 
 	rel = scan->indexRelation;
 
@@ -2014,7 +1844,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+				offnum = P_FIRSTDATAKEY(opaque);
+				if (_bt_readpage(scan, dir, &offnum, true))
 					break;
 			}
 			else if (scan->parallel_scan != NULL)
@@ -2116,7 +1947,8 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
 				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+				offnum = PageGetMaxOffsetNumber(page);
+				if (_bt_readpage(scan, dir, &offnum, true))
 					break;
 			}
 			else if (scan->parallel_scan != NULL)
@@ -2184,7 +2016,7 @@ _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
  * to be half-dead; the caller should check that condition and step left
  * again if it's important.
  */
-static Buffer
+Buffer
 _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot)
 {
 	Page		page;
@@ -2448,7 +2280,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/*
 	 * Now load data from the first page of the scan.
 	 */
-	if (!_bt_readpage(scan, dir, start))
+	if (!_bt_readpage(scan, dir, &start, true))
 	{
 		/*
 		 * There's no actually-matching data on this page.  Try to advance to
@@ -2477,7 +2309,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
  * _bt_initialize_more_data() -- initialize moreLeft/moreRight appropriately
  * for scan direction
  */
-static inline void
+inline void
 _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 {
 	/* initialize moreLeft/moreRight appropriately for scan direction */
@@ -2494,3 +2326,25 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 	so->numKilled = 0;			/* just paranoia */
 	so->markItemIndex = -1;		/* ditto */
 }
+
+/* Forward the call to either _bt_checkkeys, which is a simple
+ * and fastest way of checking keys, or to _bt_checkkeys_skip,
+ * which is a slower way to check the keys, but it will return extra
+ * information about whether or not we should stop reading the current page
+ * and skip. The expensive checking is only necessary when !isRegularMode, eg.
+ * when prefixDir!=postfixDir, which only happens when scanning from cursors backwards
+ */
+static inline bool
+_bt_checkkeys_extended(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+					   ScanDirection dir, bool isRegularMode,
+					   bool *continuescan, int *prefixskipindex)
+{
+	if (isRegularMode)
+	{
+		return _bt_checkkeys(scan, tuple, tupnatts, dir, continuescan, prefixskipindex);
+	}
+	else
+	{
+		return _bt_checkkeys_skip(scan, tuple, tupnatts, dir, continuescan, prefixskipindex);
+	}
+}
diff --git a/src/backend/access/nbtree/nbtskip.c b/src/backend/access/nbtree/nbtskip.c
new file mode 100644
index 0000000000..e2dbaf2e69
--- /dev/null
+++ b/src/backend/access/nbtree/nbtskip.c
@@ -0,0 +1,1455 @@
+/*-------------------------------------------------------------------------
+ *
+ * nbtskip.c
+ *	  Search code related to skip scan for postgres btrees.
+ *
+ *
+ * Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/nbtree/nbtskip.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/nbtree.h"
+#include "access/relscan.h"
+#include "catalog/catalog.h"
+#include "miscadmin.h"
+#include "utils/guc.h"
+#include "storage/predicate.h"
+#include "utils/lsyscache.h"
+#include "utils/rel.h"
+
+static inline void _bt_update_scankey_with_tuple(BTScanInsert scankeys,
+											Relation indexRel, IndexTuple itup, int numattrs);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, Buffer buf);
+static inline int32 _bt_compare_until(Relation rel, BTScanInsert key, IndexTuple itup, int prefix);
+static inline void
+_bt_determine_next_action(IndexScanDesc scan, BTSkipCompareResult *cmp, OffsetNumber firstOffnum,
+						  OffsetNumber lastOffnum, ScanDirection postfixDir, BTSkipState *nextAction);
+static inline void
+_bt_determine_next_action_after_skip(BTScanOpaque so, BTSkipCompareResult *cmp, ScanDirection prefixDir,
+									 ScanDirection postfixDir, int skipped, BTSkipState *nextAction);
+static inline void
+_bt_determine_next_action_after_skip_extra(BTScanOpaque so, BTSkipCompareResult *cmp, BTSkipState *nextAction);
+static inline void _bt_copy_scankey(BTScanInsert to, BTScanInsert from, int numattrs);
+static inline IndexTuple _bt_get_tuple_from_offset_with_copy(BTScanOpaque so, OffsetNumber curTupleOffnum);
+
+static void _bt_skip_update_scankey_after_read(IndexScanDesc scan, IndexTuple curTuple,
+											   ScanDirection prefixDir, ScanDirection postfixDir);
+static void _bt_skip_update_scankey_for_prefix_skip(IndexScanDesc scan, Relation indexRel,
+										int prefix, IndexTuple itup, ScanDirection prefixDir);
+static bool _bt_try_in_page_skip(IndexScanDesc scan, ScanDirection prefixDir);
+static void debug_print(IndexTuple itup, BTScanInsert scanKey, Relation rel, char *extra);
+
+/* probably to be removed but useful for debugging during patch implementation */
+static void debug_print(IndexTuple itup, BTScanInsert scanKey, Relation rel, char *extra)
+{
+	bool		isnull[INDEX_MAX_KEYS];
+	Datum		values[INDEX_MAX_KEYS];
+	char	   *lkey_desc = NULL;
+
+	/* Avoid infinite recursion -- don't instrument catalog indexes */
+	if (!IsCatalogRelation(rel))
+	{
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		Oid			typOutput;
+		bool		varlenatype;
+		char	   *val;
+		int i;
+
+		char buf[8096] = {0};
+		int idx = 0;
+
+		if (itup != NULL)
+		{
+			natts = BTreeTupleGetNAtts(itup, rel);
+			itupdesc->natts = Min(indnkeyatts, natts);
+			memset(&isnull, 0xFF, sizeof(isnull));
+			index_deform_tuple(itup, itupdesc, values, isnull);
+
+			rel->rd_index->indnkeyatts = natts;
+
+			/*
+			 * Since the regression tests should pass when the instrumentation
+			 * patch is applied, be prepared for BuildIndexValueDescription() to
+			 * return NULL due to security considerations.
+			 */
+			lkey_desc = BuildIndexValueDescription(rel, values, isnull);
+		}
+
+		for (i = 0; i < scanKey->keysz; i++)
+		{
+			ScanKey cur = &scanKey->scankeys[i];
+
+			if (i != 0)
+			{
+				buf[idx] = ',';
+				idx++;
+			}
+
+			if (!(cur->sk_flags & SK_ISNULL))
+			{
+				if (cur->sk_subtype != InvalidOid)
+					getTypeOutputInfo(cur->sk_subtype,
+									  &typOutput, &varlenatype);
+				else
+					getTypeOutputInfo(rel->rd_opcintype[i],
+									  &typOutput, &varlenatype);
+				val = OidOutputFunctionCall(typOutput, cur->sk_argument);
+				if (val)
+				{
+					unsigned long tocopy = strnlen(val, 15);
+					memcpy(buf + idx, val, tocopy);
+					idx += tocopy;
+					pfree(val);
+				}
+				else
+				{
+					memcpy(buf + idx, "n/a", 3);
+					idx += 3;
+				}
+			}
+			else
+			{
+				memcpy(buf + idx, "null", 4);
+				idx += 4;
+			}
+		}
+		buf[idx] = 0;
+
+		elog(DEBUG1, "%s : %s tuple(%s) sk(%s)",
+			 extra, RelationGetRelationName(rel), lkey_desc ? lkey_desc : "N/A", buf);
+
+		/* Cleanup */
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
+		if (lkey_desc)
+			pfree(lkey_desc);
+	}
+}
+
+/*
+ * returns whether we're at the end of a scan.
+ * the scan position can be invalid even though we still
+ * should continue the scan. this happens for example when
+ * we're scanning with prefixDir!=postfixDir. when looking at the first
+ * prefix, we traverse the items within the prefix from max to min.
+ * if none of them match, we actually run off the start of the index,
+ * meaning none of the tuples within this prefix match. the scan pos becomes
+ * invalid, however, we do need to look further to the next prefix.
+ * therefore, this function still returns true in this particular case.
+ */
+static inline bool
+_bt_skip_is_valid(BTScanOpaque so, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	return BTScanPosIsValid(so->currPos) ||
+			(!_bt_skip_is_regular_mode(prefixDir, postfixDir) &&
+			 so->skipData->curPos.nextAction != SkipStateStop);
+}
+
+/* try finding the next tuple to skip to within the local tuple storage.
+ * local tuple storage is filled during _bt_readpage with all matching
+ * tuples on that page. if we can find the next prefix here it saves
+ * us doing a scan from root.
+ * Note that this optimization only works with _bt_regular_mode == true
+ * If this is not the case, the local tuple workspace will always only
+ * contain tuples of one specific prefix (_bt_readpage will stop at
+ * the end of a prefx)
+ */
+static bool
+_bt_try_in_page_skip(IndexScanDesc scan, ScanDirection prefixDir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPosItem *currItem;
+	BTSkip skip = so->skipData;
+	IndexTuple itup = NULL;
+	bool goback;
+	int low, high, starthigh, startlow;
+	int32		result,
+				cmpval;
+	BTScanInsert key = &so->skipData->curPos.skipScanKey;
+
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+	_bt_skip_update_scankey_for_prefix_skip(scan, scan->indexRelation, skip->prefix, itup, prefixDir);
+
+	_bt_set_bsearch_flags(key->scankeys[key->keysz - 1].sk_strategy, prefixDir, &key->nextkey, &goback);
+
+	/* Requesting nextkey semantics while using scantid seems nonsensical */
+	Assert(!key->nextkey || key->scantid == NULL);
+	/* scantid-set callers must use _bt_binsrch_insert() on leaf pages */
+
+	startlow = low = ScanDirectionIsForward(prefixDir) ? so->currPos.itemIndex + 1 : so->currPos.firstItem;
+	starthigh = high = ScanDirectionIsForward(prefixDir) ? so->currPos.lastItem : so->currPos.itemIndex - 1;
+
+	/*
+	 * If there are no keys on the page, return the first available slot. Note
+	 * this covers two cases: the page is really empty (no keys), or it
+	 * contains only a high key.  The latter case is possible after vacuuming.
+	 * This can never happen on an internal page, however, since they are
+	 * never empty (an internal page must have children).
+	 */
+	if (unlikely(high < low))
+		return false;
+
+	/*
+	 * Binary search to find the first key on the page >= scan key, or first
+	 * key > scankey when nextkey is true.
+	 *
+	 * For nextkey=false (cmpval=1), the loop invariant is: all slots before
+	 * 'low' are < scan key, all slots at or after 'high' are >= scan key.
+	 *
+	 * For nextkey=true (cmpval=0), the loop invariant is: all slots before
+	 * 'low' are <= scan key, all slots at or after 'high' are > scan key.
+	 *
+	 * We can fall out when high == low.
+	 */
+	high++;						/* establish the loop invariant for high */
+
+	cmpval = key->nextkey ? 0 : 1;	/* select comparison value */
+
+	while (high > low)
+	{
+		int mid = low + ((high - low) / 2);
+
+		/* We have low <= mid < high, so mid points at a real slot */
+
+		currItem = &so->currPos.items[mid];
+		itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+		result = _bt_compare_until(scan->indexRelation, key, itup, skip->prefix);
+
+		if (result >= cmpval)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	if (high > starthigh)
+		return false;
+
+	if (goback)
+	{
+		low--;
+		if (low < startlow)
+			return false;
+	}
+
+	so->currPos.itemIndex = low;
+
+	if (DEBUG1 >= log_min_messages || DEBUG1 >= client_min_messages)
+	{
+		currItem = &so->currPos.items[low];
+		itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+		debug_print(itup, &so->skipData->curPos.skipScanKey, scan->indexRelation, "skip-in-page");
+	}
+
+	return true;
+}
+
+/*
+ *  _bt_skip() -- Skip items that have the same prefix as the most recently
+ * 				  fetched index tuple.
+ *
+ * in: pinned, not locked
+ * out: pinned, not locked (unless end of scan, then unpinned)
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTScanPosItem *currItem;
+	IndexTuple itup = NULL;
+	OffsetNumber curTupleOffnum = InvalidOffsetNumber;
+	BTSkipCompareResult cmp;
+	BTSkip skip = so->skipData;
+	OffsetNumber first;
+
+	/* in page skip only works when prefixDir == postfixDir */
+	if (!_bt_skip_is_regular_mode(prefixDir, postfixDir) || !_bt_try_in_page_skip(scan, prefixDir))
+	{
+		currItem = &so->currPos.items[so->currPos.itemIndex];
+		itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+		so->skipData->curPos.nextSkipIndex = so->skipData->prefix;
+		_bt_skip_once(scan, &itup, &curTupleOffnum, true, prefixDir, postfixDir);
+		_bt_skip_until_match(scan, &itup, &curTupleOffnum, prefixDir, postfixDir);
+		if (!_bt_skip_is_always_valid(so))
+			return false;
+
+		first = curTupleOffnum;
+		_bt_readpage(scan, postfixDir, &curTupleOffnum, _bt_skip_is_regular_mode(prefixDir, postfixDir));
+		if (DEBUG2 >= log_min_messages || DEBUG2 >= client_min_messages)
+		{
+			print_itup(BufferGetBlockNumber(so->currPos.buf), _bt_get_tuple_from_offset(so, first), NULL, scan->indexRelation,
+						"first item on page compared after skip");
+			print_itup(BufferGetBlockNumber(so->currPos.buf), _bt_get_tuple_from_offset(so, curTupleOffnum), NULL, scan->indexRelation,
+						"last item on page compared after skip");
+		}
+		_bt_compare_current_item(scan, _bt_get_tuple_from_offset(so, curTupleOffnum),
+								 IndexRelationGetNumberOfAttributes(scan->indexRelation),
+								 postfixDir, _bt_skip_is_regular_mode(prefixDir, postfixDir), &cmp);
+		_bt_determine_next_action(scan, &cmp, first, curTupleOffnum, postfixDir, &skip->curPos.nextAction);
+		skip->curPos.nextDirection = prefixDir;
+		skip->curPos.nextSkipIndex = cmp.prefixSkipIndex;
+		_bt_skip_update_scankey_after_read(scan, _bt_get_tuple_from_offset(so, curTupleOffnum), prefixDir, postfixDir);
+
+		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+	}
+
+	/* prepare for the call to _bt_next, because _bt_next increments this to get to the tuple we want to be at */
+	if (ScanDirectionIsForward(postfixDir))
+		so->currPos.itemIndex--;
+	else
+		so->currPos.itemIndex++;
+
+	return true;
+}
+
+IndexTuple
+_bt_get_tuple_from_offset(BTScanOpaque so, OffsetNumber curTupleOffnum)
+{
+	Page page = BufferGetPage(so->currPos.buf);
+	return (IndexTuple) PageGetItem(page, PageGetItemId(page, curTupleOffnum));
+}
+
+static IndexTuple
+_bt_get_tuple_from_offset_with_copy(BTScanOpaque so, OffsetNumber curTupleOffnum)
+{
+	Page page = BufferGetPage(so->currPos.buf);
+	IndexTuple itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, curTupleOffnum));
+	Size		itupsz = IndexTupleSize(itup);
+	memcpy(so->skipData->curPos.skipTuple, itup, itupsz);
+
+	return (IndexTuple) so->skipData->curPos.skipTuple;
+}
+
+static void
+_bt_determine_next_action(IndexScanDesc scan, BTSkipCompareResult *cmp, OffsetNumber firstOffnum, OffsetNumber lastOffnum, ScanDirection postfixDir, BTSkipState *nextAction)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	if (cmp->fullKeySkip)
+		*nextAction = SkipStateStop;
+	else if (ScanDirectionIsForward(postfixDir))
+	{
+		OffsetNumber firstItem = firstOffnum, lastItem = lastOffnum;
+		if (cmp->prefixSkip)
+		{
+			*nextAction = SkipStateSkip;
+		}
+		else
+		{
+			IndexTuple toCmp;
+			if (so->currPos.lastItem >= so->currPos.firstItem)
+				toCmp = _bt_get_tuple_from_offset_with_copy(so, so->currPos.items[so->currPos.lastItem].indexOffset);
+			else
+				toCmp = _bt_get_tuple_from_offset_with_copy(so, firstItem);
+			_bt_update_scankey_with_tuple(&so->skipData->currentTupleKey,
+										  scan->indexRelation, toCmp, RelationGetNumberOfAttributes(scan->indexRelation));
+			if (_bt_has_extra_quals_after_skip(so->skipData, postfixDir, so->skipData->prefix) && !cmp->equal &&
+					(cmp->prefixCmpResult != 0 ||
+					 _bt_compare_until(scan->indexRelation, &so->skipData->currentTupleKey,
+									   _bt_get_tuple_from_offset(so, lastItem), so->skipData->prefix) != 0))
+				*nextAction = SkipStateSkipExtra;
+			else
+				*nextAction = SkipStateNext;
+		}
+	}
+	else
+	{
+		OffsetNumber firstItem = lastOffnum, lastItem = firstOffnum;
+		if (cmp->prefixSkip)
+		{
+			*nextAction = SkipStateSkip;
+		}
+		else
+		{
+			IndexTuple toCmp;
+			if (so->currPos.lastItem >= so->currPos.firstItem)
+				toCmp = _bt_get_tuple_from_offset_with_copy(so, so->currPos.items[so->currPos.firstItem].indexOffset);
+			else
+				toCmp = _bt_get_tuple_from_offset_with_copy(so, lastItem);
+			_bt_update_scankey_with_tuple(&so->skipData->currentTupleKey,
+										  scan->indexRelation, toCmp, RelationGetNumberOfAttributes(scan->indexRelation));
+			if (_bt_has_extra_quals_after_skip(so->skipData, postfixDir, so->skipData->prefix) && !cmp->equal &&
+					(cmp->prefixCmpResult != 0 ||
+					 _bt_compare_until(scan->indexRelation, &so->skipData->currentTupleKey,
+									   _bt_get_tuple_from_offset(so, firstItem), so->skipData->prefix) != 0))
+				*nextAction = SkipStateSkipExtra;
+			else
+				*nextAction = SkipStateNext;
+		}
+	}
+}
+
+static inline bool
+_bt_should_prefix_skip(BTSkipCompareResult *cmp)
+{
+	return cmp->prefixSkip || cmp->prefixCmpResult != 0;
+}
+
+static inline void
+_bt_determine_next_action_after_skip(BTScanOpaque so, BTSkipCompareResult *cmp, ScanDirection prefixDir,
+									 ScanDirection postfixDir, int skipped, BTSkipState *nextAction)
+{
+	if (!_bt_skip_is_always_valid(so) || cmp->fullKeySkip)
+		*nextAction = SkipStateStop;
+	else if (cmp->equal && _bt_skip_is_regular_mode(prefixDir, postfixDir))
+		*nextAction = SkipStateNext;
+	else if (_bt_should_prefix_skip(cmp) && _bt_skip_is_regular_mode(prefixDir, postfixDir) &&
+			 ((ScanDirectionIsForward(prefixDir) && cmp->skCmpResult == -1) ||
+			  (ScanDirectionIsBackward(prefixDir) && cmp->skCmpResult == 1)))
+		*nextAction = SkipStateSkip;
+	else if (!_bt_skip_is_regular_mode(prefixDir, postfixDir) ||
+			 _bt_has_extra_quals_after_skip(so->skipData, postfixDir, skipped) ||
+			 cmp->prefixCmpResult != 0)
+		*nextAction = SkipStateSkipExtra;
+	else
+		*nextAction = SkipStateNext;
+}
+
+static inline void
+_bt_determine_next_action_after_skip_extra(BTScanOpaque so, BTSkipCompareResult *cmp, BTSkipState *nextAction)
+{
+	if (!_bt_skip_is_always_valid(so) || cmp->fullKeySkip)
+		*nextAction = SkipStateStop;
+	else if (cmp->equal)
+		*nextAction = SkipStateNext;
+	else if (_bt_should_prefix_skip(cmp))
+		*nextAction = SkipStateSkip;
+	else
+		*nextAction = SkipStateNext;
+}
+
+/* just a debug function that prints a scankey. will be removed for final patch */
+static inline void
+_print_skey(IndexScanDesc scan, BTScanInsert scanKey)
+{
+	Oid			typOutput;
+	bool		varlenatype;
+	char	   *val;
+	int i;
+	Relation rel = scan->indexRelation;
+
+	for (i = 0; i < scanKey->keysz; i++)
+	{
+		ScanKey cur = &scanKey->scankeys[i];
+		if (!IsCatalogRelation(rel))
+		{
+			if (!(cur->sk_flags & SK_ISNULL))
+			{
+				if (cur->sk_subtype != InvalidOid)
+					getTypeOutputInfo(cur->sk_subtype,
+									  &typOutput, &varlenatype);
+				else
+					getTypeOutputInfo(rel->rd_opcintype[i],
+									  &typOutput, &varlenatype);
+				val = OidOutputFunctionCall(typOutput, cur->sk_argument);
+				if (val)
+				{
+					elog(DEBUG1, "%s sk attr %d val: %s (%s, %s)",
+						 RelationGetRelationName(rel), i, val,
+						 (cur->sk_flags & SK_BT_NULLS_FIRST) != 0 ? "NULLS FIRST" : "NULLS LAST",
+						 (cur->sk_flags & SK_BT_DESC) != 0 ? "DESC" : "ASC");
+					pfree(val);
+				}
+			}
+			else
+			{
+				elog(DEBUG1, "%s sk attr %d val: NULL (%s, %s)",
+					 RelationGetRelationName(rel), i,
+					 (cur->sk_flags & SK_BT_NULLS_FIRST) != 0 ? "NULLS FIRST" : "NULLS LAST",
+					 (cur->sk_flags & SK_BT_DESC) != 0 ? "DESC" : "ASC");
+			}
+		}
+	}
+}
+
+bool
+_bt_checkkeys_skip(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+				   ScanDirection dir, bool *continuescan, int *prefixskipindex)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+
+	bool match = _bt_checkkeys(scan, tuple, tupnatts, dir, continuescan, prefixskipindex);
+	int prefixCmpResult = _bt_compare_until(scan->indexRelation, &skip->curPos.skipScanKey, tuple, skip->prefix);
+	if (*prefixskipindex == -1 && prefixCmpResult != 0)
+	{
+		*prefixskipindex = skip->prefix;
+		return false;
+	}
+	else
+	{
+		bool newcont;
+		_bt_checkkeys_threeway(scan, tuple, tupnatts, dir, &newcont, prefixskipindex);
+		if (*prefixskipindex == -1 && prefixCmpResult != 0)
+		{
+			*prefixskipindex = skip->prefix;
+			return false;
+		}
+	}
+	return match;
+}
+
+/*
+ * Compare a scankey with a given tuple but only the first prefix columns
+ * This function returns 0 if the first 'prefix' columns are equal
+ * -1 if key < itup for the first prefix columns
+ * 1 if key > itup for the first prefix columns
+ */
+int32
+_bt_compare_until(Relation rel,
+			BTScanInsert key,
+			IndexTuple itup,
+			int prefix)
+{
+	TupleDesc	itupdesc = RelationGetDescr(rel);
+	ScanKey		scankey;
+	int			ncmpkey;
+
+	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
+
+	ncmpkey = Min(prefix, key->keysz);
+	scankey = key->scankeys;
+	for (int i = 1; i <= ncmpkey; i++)
+	{
+		Datum		datum;
+		bool		isNull;
+		int32		result;
+
+		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
+
+		/* see comments about NULLs handling in btbuild */
+		if (scankey->sk_flags & SK_ISNULL)	/* key is NULL */
+		{
+			if (isNull)
+				result = 0;		/* NULL "=" NULL */
+			else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+				result = -1;	/* NULL "<" NOT_NULL */
+			else
+				result = 1;		/* NULL ">" NOT_NULL */
+		}
+		else if (isNull)		/* key is NOT_NULL and item is NULL */
+		{
+			if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+				result = 1;		/* NOT_NULL ">" NULL */
+			else
+				result = -1;	/* NOT_NULL "<" NULL */
+		}
+		else
+		{
+			/*
+			 * The sk_func needs to be passed the index value as left arg and
+			 * the sk_argument as right arg (they might be of different
+			 * types).  Since it is convenient for callers to think of
+			 * _bt_compare as comparing the scankey to the index item, we have
+			 * to flip the sign of the comparison result.  (Unless it's a DESC
+			 * column, in which case we *don't* flip the sign.)
+			 */
+			result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+													 scankey->sk_collation,
+													 datum,
+													 scankey->sk_argument));
+
+			if (!(scankey->sk_flags & SK_BT_DESC))
+				INVERT_COMPARE_RESULT(result);
+		}
+
+		/* if the keys are unequal, return the difference */
+		if (result != 0)
+			return result;
+
+		scankey++;
+	}
+	return 0;
+}
+
+
+/*
+ * Create initial scankeys for skipping and stores them in the skipData
+ * structure
+ */
+void
+_bt_skip_create_scankeys(Relation rel, BTScanOpaque so)
+{
+	int keysCount;
+	BTSkip skip = so->skipData;
+	StrategyNumber stratTotal;
+	ScanKey		keyPointers[INDEX_MAX_KEYS];
+	bool goback;
+	/* we need to create both forward and backward keys because the scan direction
+	 * may change at any moment in scans with a cursor.
+	 * we could technically delay creation of the second until first use as an optimization
+	 * but that is not implemented yet.
+	 */
+	keysCount = _bt_choose_scan_keys(so->keyData, so->numberOfKeys, ForwardScanDirection,
+									 keyPointers, skip->fwdNotNullKeys, &stratTotal, skip->prefix);
+	_bt_create_insertion_scan_key(rel, ForwardScanDirection, keyPointers, keysCount,
+								  &skip->fwdScanKey, &stratTotal, &goback);
+
+	keysCount = _bt_choose_scan_keys(so->keyData, so->numberOfKeys, BackwardScanDirection,
+									 keyPointers, skip->bwdNotNullKeys, &stratTotal, skip->prefix);
+	_bt_create_insertion_scan_key(rel, BackwardScanDirection, keyPointers, keysCount,
+								  &skip->bwdScanKey, &stratTotal, &goback);
+
+	_bt_metaversion(rel, &skip->curPos.skipScanKey.heapkeyspace,
+					&skip->curPos.skipScanKey.allequalimage);
+	skip->curPos.skipScanKey.anynullkeys = false; /* unused */
+	skip->curPos.skipScanKey.nextkey = false;
+	skip->curPos.skipScanKey.pivotsearch = false;
+	skip->curPos.skipScanKey.scantid = NULL;
+	skip->curPos.skipScanKey.keysz = 0;
+
+	/* setup scankey for the current tuple as well. it's not necessarily that
+	 * we will use the data from the current tuple already,
+	 * but we need the rest of the data structure to be set up correctly
+	 * for when we use it to create skip->curPos.skipScanKey keys later
+	 */
+	_bt_mkscankey(rel, NULL, &skip->currentTupleKey);
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * 								within a page, specified by the buffer.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key,
+						Buffer buf)
+{
+	/* @todo: optimization is still possible here to
+	 * only check either the low or the high, depending on
+	 * which direction *we came from* AND which direction
+	 * *we are planning to scan*
+	 */
+	OffsetNumber low, high;
+	Page page = BufferGetPage(buf);
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	int			ans_lo, ans_hi;
+
+	low = P_FIRSTDATAKEY(opaque);
+	high = PageGetMaxOffsetNumber(page);
+
+	if (unlikely(high < low))
+		return false;
+
+	ans_lo = _bt_compare(scan->indexRelation,
+					   key, page, low);
+	ans_hi = _bt_compare(scan->indexRelation,
+					   key, page, high);
+	if (key->nextkey)
+	{
+		/* sk < last && sk >= first */
+		return ans_lo >= 0 && ans_hi == -1;
+	}
+	else
+	{
+		/* sk <= last && sk > first */
+		return ans_lo == 1 && ans_hi <= 0;
+	}
+}
+
+/* in: pinned and locked, out: pinned and locked (unless end of scan) */
+static void
+_bt_skip_find(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+			  BTScanInsert scanKey, ScanDirection dir)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	OffsetNumber offnum;
+	BTStack stack;
+	Buffer buf;
+	bool goback;
+	Page		page;
+	BTPageOpaque opaque;
+	OffsetNumber minoff;
+	Relation rel = scan->indexRelation;
+	bool fromroot = true;
+
+	_bt_set_bsearch_flags(scanKey->scankeys[scanKey->keysz - 1].sk_strategy, dir, &scanKey->nextkey, &goback);
+
+	if ((DEBUG2 >= log_min_messages || DEBUG2 >= client_min_messages) && !IsCatalogRelation(rel))
+	{
+		if (*curTuple != NULL)
+			print_itup(BufferGetBlockNumber(so->currPos.buf), *curTuple, NULL, rel,
+						"before btree search");
+
+		elog(DEBUG1, "%s searching tree with %d keys, nextkey=%d, goback=%d",
+			 RelationGetRelationName(rel), scanKey->keysz, scanKey->nextkey,
+			 goback);
+
+		_print_skey(scan, scanKey);
+	}
+
+	if (*curTupleOffnum == InvalidOffsetNumber)
+	{
+		BTScanPosUnpinIfPinned(so->currPos);
+	}
+	else
+	{
+		if (_bt_scankey_within_page(scan, scanKey, so->currPos.buf))
+		{
+			elog(DEBUG1, "sk found within current page");
+
+			offnum = _bt_binsrch(scan->indexRelation, scanKey, so->currPos.buf);
+			fromroot = false;
+		}
+		else
+		{
+			_bt_unlockbuf(rel, so->currPos.buf);
+			ReleaseBuffer(so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+		}
+	}
+
+	/*
+	 * We haven't found scan key within the current page, so let's scan from
+	 * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+	 * number
+	 */
+	if (fromroot)
+	{
+		stack = _bt_search(scan->indexRelation, scanKey,
+						   &buf, BT_READ, scan->xs_snapshot);
+		_bt_freestack(stack);
+		so->currPos.buf = buf;
+
+		offnum = _bt_binsrch(scan->indexRelation, scanKey, buf);
+
+		/* Lock the page for SERIALIZABLE transactions */
+		PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+						  scan->xs_snapshot);
+	}
+
+	page = BufferGetPage(so->currPos.buf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	if (goback)
+	{
+		offnum = OffsetNumberPrev(offnum);
+		minoff = P_FIRSTDATAKEY(opaque);
+		if (offnum < minoff)
+		{
+			_bt_unlockbuf(rel, so->currPos.buf);
+			if (!_bt_step_back_page(scan, curTuple, curTupleOffnum))
+				return;
+			page = BufferGetPage(so->currPos.buf);
+			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			offnum = PageGetMaxOffsetNumber(page);
+		}
+	}
+	else if (offnum > PageGetMaxOffsetNumber(page))
+	{
+		BlockNumber next = opaque->btpo_next;
+		_bt_unlockbuf(rel, so->currPos.buf);
+		if (!_bt_step_forward_page(scan, next, curTuple, curTupleOffnum))
+			return;
+		page = BufferGetPage(so->currPos.buf);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		offnum = P_FIRSTDATAKEY(opaque);
+	}
+
+	/* We know in which direction to look */
+	_bt_initialize_more_data(so, dir);
+
+	*curTupleOffnum = offnum;
+	*curTuple = _bt_get_tuple_from_offset(so, offnum);
+	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+	if (DEBUG2 >= log_min_messages || DEBUG2 >= client_min_messages)
+		print_itup(BufferGetBlockNumber(so->currPos.buf), *curTuple, NULL, rel,
+					"after btree search");
+}
+
+static inline bool
+_bt_step_one_page(IndexScanDesc scan, ScanDirection dir, IndexTuple *curTuple,
+				  OffsetNumber *curTupleOffnum)
+{
+	if (ScanDirectionIsForward(dir))
+	{
+		BTScanOpaque so = (BTScanOpaque) scan->opaque;
+		return _bt_step_forward_page(scan, so->currPos.nextPage, curTuple, curTupleOffnum);
+	}
+	else
+	{
+		return _bt_step_back_page(scan, curTuple, curTupleOffnum);
+	}
+}
+
+/* in: possibly pinned, but unlocked, out: pinned and locked */
+bool
+_bt_step_forward_page(IndexScanDesc scan, BlockNumber next, IndexTuple *curTuple,
+					  OffsetNumber *curTupleOffnum)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation rel = scan->indexRelation;
+	BlockNumber blkno = next;
+	Page page;
+	BTPageOpaque opaque;
+
+	Assert(BTScanPosIsValid(so->currPos));
+
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_bt_killitems(scan);
+
+	/*
+	 * Before we modify currPos, make a copy of the page data if there was a
+	 * mark position that needs it.
+	 */
+	if (so->markItemIndex >= 0)
+	{
+		/* bump pin on current buffer for assignment to mark buffer */
+		if (BTScanPosIsPinned(so->currPos))
+			IncrBufferRefCount(so->currPos.buf);
+		memcpy(&so->markPos, &so->currPos,
+			   offsetof(BTScanPosData, items[1]) +
+			   so->currPos.lastItem * sizeof(BTScanPosItem));
+		if (so->markTuples)
+			memcpy(so->markTuples, so->currTuples,
+				   so->currPos.nextTupleOffset);
+		so->markPos.itemIndex = so->markItemIndex;
+		if (so->skipData)
+			memcpy(&so->skipData->markPos, &so->skipData->curPos,
+				   sizeof(BTSkipPosData));
+		so->markItemIndex = -1;
+	}
+
+	/* Remember we left a page with data */
+	so->currPos.moreLeft = true;
+
+	/* release the previous buffer, if pinned */
+	BTScanPosUnpinIfPinned(so->currPos);
+
+	{
+		for (;;)
+		{
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
+			if (blkno == P_NONE)
+			{
+				_bt_parallel_done(scan);
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+
+			/* check for interrupts while we're not holding any buffer lock */
+			CHECK_FOR_INTERRUPTS();
+			/* step right one page */
+			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			page = BufferGetPage(so->currPos.buf);
+			TestForOldSnapshot(scan->xs_snapshot, rel, page);
+			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			/* check for deleted page */
+			if (!P_IGNORE(opaque))
+			{
+				PredicateLockPage(rel, blkno, scan->xs_snapshot);
+				*curTupleOffnum = P_FIRSTDATAKEY(opaque);
+				*curTuple = _bt_get_tuple_from_offset(so, *curTupleOffnum);
+				break;
+			}
+
+			blkno = opaque->btpo_next;
+			_bt_relbuf(rel, so->currPos.buf);
+		}
+	}
+
+	return true;
+}
+
+/* in: possibly pinned, but unlocked, out: pinned and locked */
+bool
+_bt_step_back_page(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	Assert(BTScanPosIsValid(so->currPos));
+
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		_bt_killitems(scan);
+
+	/*
+	 * Before we modify currPos, make a copy of the page data if there was a
+	 * mark position that needs it.
+	 */
+	if (so->markItemIndex >= 0)
+	{
+		/* bump pin on current buffer for assignment to mark buffer */
+		if (BTScanPosIsPinned(so->currPos))
+			IncrBufferRefCount(so->currPos.buf);
+		memcpy(&so->markPos, &so->currPos,
+			   offsetof(BTScanPosData, items[1]) +
+			   so->currPos.lastItem * sizeof(BTScanPosItem));
+		if (so->markTuples)
+			memcpy(so->markTuples, so->currTuples,
+				   so->currPos.nextTupleOffset);
+		if (so->skipData)
+			memcpy(&so->skipData->markPos, &so->skipData->curPos,
+				   sizeof(BTSkipPosData));
+		so->markPos.itemIndex = so->markItemIndex;
+		so->markItemIndex = -1;
+	}
+
+	/* Remember we left a page with data */
+	so->currPos.moreRight = true;
+
+	/* Not parallel, so just use our own notion of the current page */
+
+	{
+		Relation	rel;
+		Page		page;
+		BTPageOpaque opaque;
+
+		rel = scan->indexRelation;
+
+		if (BTScanPosIsPinned(so->currPos))
+			_bt_lockbuf(rel, so->currPos.buf, BT_READ);
+		else
+			so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+
+		for (;;)
+		{
+			/* Step to next physical page */
+			so->currPos.buf = _bt_walk_left(rel, so->currPos.buf,
+											scan->xs_snapshot);
+
+			/* if we're physically at end of index, return failure */
+			if (so->currPos.buf == InvalidBuffer)
+			{
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+
+			/*
+			 * Okay, we managed to move left to a non-deleted page. Done if
+			 * it's not half-dead and contains matching tuples. Else loop back
+			 * and do it all again.
+			 */
+			page = BufferGetPage(so->currPos.buf);
+			TestForOldSnapshot(scan->xs_snapshot, rel, page);
+			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			if (!P_IGNORE(opaque))
+			{
+				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
+				*curTupleOffnum = PageGetMaxOffsetNumber(page);
+				*curTuple = _bt_get_tuple_from_offset(so, *curTupleOffnum);
+				break;
+			}
+		}
+	}
+
+	return true;
+}
+
+/* holds lock as long as curTupleOffnum != InvalidOffsetNumber */
+bool
+_bt_skip_find_next(IndexScanDesc scan, IndexTuple curTuple, OffsetNumber curTupleOffnum,
+				   ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	BTSkipCompareResult cmp;
+
+	while (_bt_skip_is_valid(so, prefixDir, postfixDir))
+	{
+		bool found;
+		_bt_skip_until_match(scan, &curTuple, &curTupleOffnum, prefixDir, postfixDir);
+
+		while (_bt_skip_is_always_valid(so))
+		{
+			OffsetNumber first = curTupleOffnum;
+			found = _bt_readpage(scan, postfixDir, &curTupleOffnum,
+								 _bt_skip_is_regular_mode(prefixDir, postfixDir));
+			if (DEBUG2 >= log_min_messages || DEBUG2 >= client_min_messages)
+			{
+				print_itup(BufferGetBlockNumber(so->currPos.buf),
+						   _bt_get_tuple_from_offset(so, first), NULL, scan->indexRelation,
+							"first item on page compared");
+				print_itup(BufferGetBlockNumber(so->currPos.buf),
+						   _bt_get_tuple_from_offset(so, curTupleOffnum), NULL, scan->indexRelation,
+							"last item on page compared");
+			}
+			_bt_compare_current_item(scan, _bt_get_tuple_from_offset(so, curTupleOffnum),
+									 IndexRelationGetNumberOfAttributes(scan->indexRelation),
+									 postfixDir, _bt_skip_is_regular_mode(prefixDir, postfixDir), &cmp);
+			_bt_determine_next_action(scan, &cmp, first, curTupleOffnum,
+									  postfixDir, &skip->curPos.nextAction);
+			skip->curPos.nextDirection = prefixDir;
+			skip->curPos.nextSkipIndex = cmp.prefixSkipIndex;
+
+			if (found)
+			{
+				_bt_skip_update_scankey_after_read(scan, _bt_get_tuple_from_offset(so, curTupleOffnum),
+												   prefixDir, postfixDir);
+				return true;
+			}
+			else if (skip->curPos.nextAction == SkipStateNext)
+			{
+				if (curTupleOffnum != InvalidOffsetNumber)
+					_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+				if (!_bt_step_one_page(scan, postfixDir, &curTuple, &curTupleOffnum))
+					return false;
+			}
+			else if (skip->curPos.nextAction == SkipStateSkip || skip->curPos.nextAction == SkipStateSkipExtra)
+			{
+				curTuple = _bt_get_tuple_from_offset(so, curTupleOffnum);
+				_bt_skip_update_scankey_after_read(scan, curTuple, prefixDir, postfixDir);
+				_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+				curTupleOffnum = InvalidOffsetNumber;
+				curTuple = NULL;
+				break;
+			}
+			else if (skip->curPos.nextAction == SkipStateStop)
+			{
+				_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+				BTScanPosUnpinIfPinned(so->currPos);
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+			else
+			{
+				Assert(false);
+			}
+		}
+	}
+	return false;
+}
+
+void
+_bt_skip_until_match(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+					 ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	while (_bt_skip_is_valid(so, prefixDir, postfixDir) &&
+		   (skip->curPos.nextAction == SkipStateSkip || skip->curPos.nextAction == SkipStateSkipExtra))
+	{
+		_bt_skip_once(scan, curTuple, curTupleOffnum,
+					  skip->curPos.nextAction == SkipStateSkip, prefixDir, postfixDir);
+	}
+}
+
+void
+_bt_compare_current_item(IndexScanDesc scan, IndexTuple tuple, int tupnatts, ScanDirection dir,
+						 bool isRegularMode, BTSkipCompareResult* cmp)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+
+	if (_bt_skip_is_always_valid(so))
+	{
+		bool continuescan = true;
+
+		cmp->equal = _bt_checkkeys(scan, tuple, tupnatts, dir, &continuescan, &cmp->prefixSkipIndex);
+		cmp->fullKeySkip = !continuescan;
+		/* prefix can be smaller than scankey due to extra quals being added
+		 * therefore we need to compare both. @todo this can be optimized into one function call */
+		cmp->prefixCmpResult = _bt_compare_until(scan->indexRelation, &skip->curPos.skipScanKey, tuple, skip->prefix);
+		cmp->skCmpResult = _bt_compare_until(scan->indexRelation,
+											 &skip->curPos.skipScanKey, tuple, skip->curPos.skipScanKey.keysz);
+		if (cmp->prefixSkipIndex == -1)
+		{
+			if (isRegularMode)
+			{
+				cmp->prefixSkip = false;
+				cmp->prefixSkipIndex = skip->prefix;
+			}
+			else
+			{
+				cmp->prefixSkip = ScanDirectionIsForward(dir) ? cmp->prefixCmpResult < 0 : cmp->prefixCmpResult > 0;
+				cmp->prefixSkipIndex = skip->prefix;
+			}
+		}
+		else
+		{
+			int newskip = -1;
+			_bt_checkkeys_threeway(scan, tuple, tupnatts, dir, &continuescan, &newskip);
+			if (newskip != -1)
+			{
+				cmp->prefixSkip = true;
+				cmp->prefixSkipIndex = newskip;
+			}
+			else
+			{
+				if (isRegularMode)
+				{
+					cmp->prefixSkip = false;
+					cmp->prefixSkipIndex = skip->prefix;
+				}
+				else
+				{
+					cmp->prefixSkip = ScanDirectionIsForward(dir) ? cmp->prefixCmpResult < 0 : cmp->prefixCmpResult > 0;
+					cmp->prefixSkipIndex = skip->prefix;
+				}
+			}
+		}
+
+		if (DEBUG2 >= log_min_messages || DEBUG2 >= client_min_messages)
+		{
+			print_itup(BufferGetBlockNumber(so->currPos.buf), tuple, NULL, scan->indexRelation,
+						"compare item");
+			_print_skey(scan, &skip->curPos.skipScanKey);
+			elog(DEBUG1, "result: eq: %d fkskip: %d pfxskip: %d prefixcmpres: %d prefixskipidx: %d", cmp->equal, cmp->fullKeySkip,
+				 _bt_should_prefix_skip(cmp), cmp->prefixCmpResult, cmp->prefixSkipIndex);
+		}
+	}
+	else
+	{
+		/* we cannot stop the scan if !isRegularMode - then we do need to skip to the next prefix */
+		cmp->fullKeySkip = isRegularMode;
+		cmp->equal = false;
+		cmp->prefixCmpResult = -2;
+		cmp->prefixSkip = true;
+		cmp->prefixSkipIndex = skip->prefix;
+		cmp->skCmpResult = -2;
+	}
+}
+
+void
+_bt_skip_once(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+			  bool forceSkip, ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	BTSkipCompareResult cmp;
+	bool doskip = forceSkip;
+	int skipIndex = skip->curPos.nextSkipIndex;
+	skip->curPos.nextAction = SkipStateSkipExtra;
+
+	while (doskip)
+	{
+		int toskip = skipIndex;
+		if (*curTuple != NULL)
+		{
+			if (skip->prefix <= skipIndex || !_bt_skip_is_regular_mode(prefixDir, postfixDir))
+			{
+				toskip = skip->prefix;
+			}
+
+			_bt_skip_update_scankey_for_prefix_skip(scan, scan->indexRelation,
+													toskip, *curTuple, prefixDir);
+		}
+
+		if (DEBUG1 >= log_min_messages || DEBUG1 >= client_min_messages)
+		{
+			debug_print(*curTuple, &so->skipData->curPos.skipScanKey, scan->indexRelation, "skip");
+		}
+
+		_bt_skip_find(scan, curTuple, curTupleOffnum, &skip->curPos.skipScanKey, prefixDir);
+
+		if (_bt_skip_is_always_valid(so))
+		{
+			_bt_skip_update_scankey_for_extra_skip(scan, scan->indexRelation,
+												   prefixDir, prefixDir, true, *curTuple);
+			_bt_compare_current_item(scan, *curTuple,
+									 IndexRelationGetNumberOfAttributes(scan->indexRelation),
+									 prefixDir,
+									 _bt_skip_is_regular_mode(prefixDir, postfixDir), &cmp);
+			skipIndex = cmp.prefixSkipIndex;
+			_bt_determine_next_action_after_skip(so, &cmp, prefixDir,
+												 postfixDir, toskip, &skip->curPos.nextAction);
+		}
+		else
+		{
+			skip->curPos.nextAction = SkipStateStop;
+		}
+		doskip = skip->curPos.nextAction == SkipStateSkip;
+	}
+	if (skip->curPos.nextAction != SkipStateStop && skip->curPos.nextAction != SkipStateNext)
+		_bt_skip_extra_conditions(scan, curTuple, curTupleOffnum, prefixDir, postfixDir, &cmp);
+}
+
+void
+_bt_skip_extra_conditions(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+						  ScanDirection prefixDir, ScanDirection postfixDir, BTSkipCompareResult *cmp)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	bool regularMode = _bt_skip_is_regular_mode(prefixDir, postfixDir);
+	if (_bt_skip_is_always_valid(so))
+	{
+		do
+		{
+			if (*curTuple != NULL)
+				_bt_skip_update_scankey_for_extra_skip(scan, scan->indexRelation,
+													   postfixDir, prefixDir, false, *curTuple);
+			if (DEBUG1 >= log_min_messages || DEBUG1 >= client_min_messages)
+			{
+				debug_print(*curTuple, &so->skipData->curPos.skipScanKey, scan->indexRelation, "skip-extra");
+			}
+			_bt_skip_find(scan, curTuple, curTupleOffnum, &skip->curPos.skipScanKey, postfixDir);
+			_bt_compare_current_item(scan, *curTuple,
+									 IndexRelationGetNumberOfAttributes(scan->indexRelation),
+									 postfixDir, _bt_skip_is_regular_mode(prefixDir, postfixDir), cmp);
+		} while (regularMode && cmp->prefixCmpResult != 0 && !cmp->equal && !cmp->fullKeySkip);
+		skip->curPos.nextSkipIndex = cmp->prefixSkipIndex;
+	}
+	_bt_determine_next_action_after_skip_extra(so, cmp, &skip->curPos.nextAction);
+}
+
+static void
+_bt_skip_update_scankey_after_read(IndexScanDesc scan, IndexTuple curTuple,
+								   ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	if (skip->curPos.nextAction == SkipStateSkip)
+	{
+		int toskip = skip->curPos.nextSkipIndex;
+		if (skip->prefix <= skip->curPos.nextSkipIndex ||
+				!_bt_skip_is_regular_mode(prefixDir, postfixDir))
+		{
+			toskip = skip->prefix;
+		}
+
+		if (_bt_skip_is_regular_mode(prefixDir, postfixDir))
+			_bt_skip_update_scankey_for_prefix_skip(scan, scan->indexRelation,
+													toskip, curTuple, prefixDir);
+		else
+			_bt_skip_update_scankey_for_prefix_skip(scan, scan->indexRelation,
+													toskip, NULL, prefixDir);
+	}
+	else if (skip->curPos.nextAction == SkipStateSkipExtra)
+	{
+		_bt_skip_update_scankey_for_extra_skip(scan, scan->indexRelation,
+											   postfixDir, prefixDir, false, curTuple);
+	}
+}
+
+static inline int
+_bt_compare_one(ScanKey scankey, Datum datum2, bool isNull2)
+{
+	int32		result;
+	Datum datum1 = scankey->sk_argument;
+	bool isNull1 = scankey->sk_flags & SK_ISNULL;
+	/* see comments about NULLs handling in btbuild */
+	if (isNull1)	/* key is NULL */
+	{
+		if (isNull2)
+			result = 0;		/* NULL "=" NULL */
+		else if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+			result = -1;	/* NULL "<" NOT_NULL */
+		else
+			result = 1;		/* NULL ">" NOT_NULL */
+	}
+	else if (isNull2)		/* key is NOT_NULL and item is NULL */
+	{
+		if (scankey->sk_flags & SK_BT_NULLS_FIRST)
+			result = 1;		/* NOT_NULL ">" NULL */
+		else
+			result = -1;	/* NOT_NULL "<" NULL */
+	}
+	else
+	{
+		/*
+		 * The sk_func needs to be passed the index value as left arg and
+		 * the sk_argument as right arg (they might be of different
+		 * types).  Since it is convenient for callers to think of
+		 * _bt_compare as comparing the scankey to the index item, we have
+		 * to flip the sign of the comparison result.  (Unless it's a DESC
+		 * column, in which case we *don't* flip the sign.)
+		 */
+		result = DatumGetInt32(FunctionCall2Coll(&scankey->sk_func,
+												 scankey->sk_collation,
+												 datum2,
+												 datum1));
+
+		if (!(scankey->sk_flags & SK_BT_DESC))
+			INVERT_COMPARE_RESULT(result);
+	}
+	return result;
+}
+
+/*
+ * set up new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_scankey_with_tuple(BTScanInsert insertKey, Relation indexRel, IndexTuple itup, int numattrs)
+{
+	TupleDesc		itupdesc;
+	int				i;
+	ScanKey			scankeys = insertKey->scankeys;
+
+	insertKey->keysz = numattrs;
+	itupdesc = RelationGetDescr(indexRel);
+	for (i = 0; i < numattrs; i++)
+	{
+		Datum datum;
+		bool null;
+		int flags;
+
+		datum = index_getattr(itup, i + 1, itupdesc, &null);
+		flags = (null ? SK_ISNULL : 0) |
+				(indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+		scankeys[i].sk_flags = flags;
+		scankeys[i].sk_argument = datum;
+	}
+}
+
+/* copy the elements important to a skip from one insertion sk to another */
+static inline void
+_bt_copy_scankey(BTScanInsert to, BTScanInsert from, int numattrs)
+{
+	memcpy(to->scankeys, from->scankeys, sizeof(ScanKeyData) * (unsigned long)numattrs);
+	to->nextkey = from->nextkey;
+	to->keysz = numattrs;
+}
+
+/*
+ * Updates the existing scankey for skipping to the next prefix
+ * alwaysUsePrefix determines how many attrs the scankey will have
+ * when true, it will always have skip->prefix number of attributes,
+ * otherwise, the value can be less, which will be determined by the comparison
+ * result with the current tuple.
+ * for example, a SELECT * FROM tbl WHERE b<2, index (a,b,c) and when skipping with prefix size=2
+ * if we encounter the tuple (1,3,1) - this does not match the qual b<2. however, we also know that
+ * it is not useful to skip to any next qual with prefix=2 (eg. (1,4)), because that will definitely not
+ * match either. However, we do want to skip to eg. (2,0). Therefore, we skip over prefix=1 in this case.
+ *
+ * the provided itup may be null. this happens when we don't want to use the current tuple to update
+ * the scankey, but instead want to use the existing curPos.skipScanKey to fill currentTupleKey. this accounts
+ * for some edge cases.
+ */
+static void
+_bt_skip_update_scankey_for_prefix_skip(IndexScanDesc scan, Relation indexRel,
+										int prefix, IndexTuple itup, ScanDirection prefixDir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	/* we use skip->prefix is alwaysUsePrefix is set or if skip->prefix is smaller than whatever the
+	 * comparison result provided, such that we never skip more than skip->prefix
+	 */
+	int numattrs = prefix;
+
+	if (itup != NULL)
+	{
+		Size		itupsz = IndexTupleSize(itup);
+		memcpy(so->skipData->curPos.skipTuple, itup, itupsz);
+
+		_bt_update_scankey_with_tuple(&skip->currentTupleKey, indexRel, (IndexTuple)so->skipData->curPos.skipTuple, numattrs);
+		_bt_copy_scankey(&skip->curPos.skipScanKey, &skip->currentTupleKey, numattrs);
+	}
+	else
+	{
+		skip->curPos.skipScanKey.keysz = numattrs;
+		_bt_copy_scankey(&skip->currentTupleKey, &skip->curPos.skipScanKey, numattrs);
+	}
+	/* update strategy for last attribute as we will use this to determine the rest of the
+	 * rest of the flags (goback) when doing the actual tree search
+	 */
+	skip->currentTupleKey.scankeys[numattrs - 1].sk_strategy =
+			skip->curPos.skipScanKey.scankeys[numattrs - 1].sk_strategy =
+			ScanDirectionIsForward(prefixDir) ? BTGreaterStrategyNumber : BTLessStrategyNumber;
+}
+
+/* update the scankey for skipping the 'extra' conditions, opportunities
+ * that arise when we have just skipped to a new prefix and can try to skip
+ * within the prefix to the right tuple by using extra quals when available
+ *
+ * @todo as an optimization it should be possible to optimize calls to this function
+ * and to _bt_skip_update_scankey_for_prefix_skip to some more specific functions that
+ * will need to do less copying of data.
+ */
+void
+_bt_skip_update_scankey_for_extra_skip(IndexScanDesc scan, Relation indexRel, ScanDirection curDir,
+									   ScanDirection prefixDir, bool prioritizeEqual, IndexTuple itup)
+{
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	BTSkip skip = so->skipData;
+	BTScanInsert toCopy;
+	int i, left, lastNonTuple = skip->prefix;
+
+	/* first make sure that currentTupleKey is correct at all times */
+	_bt_skip_update_scankey_for_prefix_skip(scan, indexRel, skip->prefix, itup, prefixDir);
+	/* then do the actual work to setup curPos.skipScanKey - distinguish between work that depends on overallDir
+	 * (those attributes between attribute number 1 and 'prefix' inclusive)
+	 * and work that depends on curDir
+	 * (those attributes between attribute number 'prefix' + 1 and fwdScanKey.keysz inclusive)
+	 */
+	if (ScanDirectionIsForward(prefixDir))
+	{
+		/*
+		 * if overallDir is Forward, we need to choose between fwdScanKey or
+		 * currentTupleKey. we need to choose the most restrictive one -
+		 * in most cases this means choosing eg. a>5 over a=2 when scanning forward,
+		 * unless prioritizeEqual is set. this is done for certain special cases
+		 */
+		for (i = 0; i < skip->prefix; i++)
+		{
+			ScanKey scankey = &skip->fwdScanKey.scankeys[i];
+			ScanKey scankeyItem = &skip->currentTupleKey.scankeys[i];
+			if (scankey->sk_attno != 0 && (_bt_compare_one(scankey, scankeyItem->sk_argument, scankeyItem->sk_flags & SK_ISNULL) > 0
+										   || (prioritizeEqual && scankey->sk_strategy == BTEqualStrategyNumber)))
+			{
+				memcpy(skip->curPos.skipScanKey.scankeys + i, scankey, sizeof(ScanKeyData));
+				lastNonTuple = i;
+			}
+			else
+			{
+				if (lastNonTuple < i)
+					break;
+				memcpy(skip->curPos.skipScanKey.scankeys + i, scankeyItem, sizeof(ScanKeyData));
+			}
+			/* for now choose equal here - it could actually be improved a bit @todo by choosing the strategy
+			 * from the scankeys, but it doesn't matter a lot
+			 */
+			skip->curPos.skipScanKey.scankeys[i].sk_strategy = BTEqualStrategyNumber;
+		}
+	}
+	else
+	{
+		/* similar for backward but in opposite direction */
+		for (i = 0; i < skip->prefix; i++)
+		{
+			ScanKey scankey = &skip->bwdScanKey.scankeys[i];
+			ScanKey scankeyItem = &skip->currentTupleKey.scankeys[i];
+			if (scankey->sk_attno != 0 && (_bt_compare_one(scankey, scankeyItem->sk_argument, scankeyItem->sk_flags & SK_ISNULL) < 0
+										   || (prioritizeEqual && scankey->sk_strategy == BTEqualStrategyNumber)))
+			{
+				memcpy(skip->curPos.skipScanKey.scankeys + i, scankey, sizeof(ScanKeyData));
+				lastNonTuple = i;
+			}
+			else
+			{
+				if (lastNonTuple < i)
+					break;
+				memcpy(skip->curPos.skipScanKey.scankeys + i, scankeyItem, sizeof(ScanKeyData));
+			}
+			skip->curPos.skipScanKey.scankeys[i].sk_strategy = BTEqualStrategyNumber;
+		}
+	}
+
+	/*
+	 * the remaining keys are the quals after the prefix
+	 */
+	if (ScanDirectionIsForward(curDir))
+		toCopy = &skip->fwdScanKey;
+	else
+		toCopy = &skip->bwdScanKey;
+
+	if (lastNonTuple >= skip->prefix - 1)
+	{
+		left = toCopy->keysz - skip->prefix;
+		if (left > 0)
+		{
+			memcpy(skip->curPos.skipScanKey.scankeys + skip->prefix, toCopy->scankeys + i, sizeof(ScanKeyData) * (unsigned long)left);
+		}
+		skip->curPos.skipScanKey.keysz = toCopy->keysz;
+	}
+	else
+	{
+		skip->curPos.skipScanKey.keysz = lastNonTuple + 1;
+	}
+}
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc220146fd..78ceb6aee4 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -560,7 +560,7 @@ _bt_leafbuild(BTSpool *btspool, BTSpool *btspool2)
 
 	wstate.heap = btspool->heap;
 	wstate.index = btspool->index;
-	wstate.inskey = _bt_mkscankey(wstate.index, NULL);
+	wstate.inskey = _bt_mkscankey(wstate.index, NULL, NULL);
 	/* _bt_mkscankey() won't set allequalimage without metapage */
 	wstate.inskey->allequalimage = _bt_allequalimage(wstate.index, true);
 	wstate.btws_use_wal = RelationNeedsWAL(wstate.index);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index ed67863c56..e91630b050 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -49,10 +49,10 @@ static bool _bt_compare_scankey_args(IndexScanDesc scan, ScanKey op,
 									 ScanKey leftarg, ScanKey rightarg,
 									 bool *result);
 static bool _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption);
-static void _bt_mark_scankey_required(ScanKey skey);
+static void _bt_mark_scankey_required(ScanKey skey, int forwardReqFlag, int backwardReqFlag);
 static bool _bt_check_rowcompare(ScanKey skey,
 								 IndexTuple tuple, int tupnatts, TupleDesc tupdesc,
-								 ScanDirection dir, bool *continuescan);
+								 ScanDirection dir, bool *continuescan, int *prefixskipindex);
 static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
 						   IndexTuple firstright, BTScanInsert itup_key);
 
@@ -87,9 +87,8 @@ static int	_bt_keep_natts(Relation rel, IndexTuple lastleft,
  *		field themselves.
  */
 BTScanInsert
-_bt_mkscankey(Relation rel, IndexTuple itup)
+_bt_mkscankey(Relation rel, IndexTuple itup, BTScanInsert key)
 {
-	BTScanInsert key;
 	ScanKey		skey;
 	TupleDesc	itupdesc;
 	int			indnkeyatts;
@@ -109,8 +108,10 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	 * Truncated attributes and non-key attributes are omitted from the final
 	 * scan key.
 	 */
-	key = palloc(offsetof(BTScanInsertData, scankeys) +
-				 sizeof(ScanKeyData) * indnkeyatts);
+	if (key == NULL)
+		key = palloc(offsetof(BTScanInsertData, scankeys) +
+					 sizeof(ScanKeyData) * indnkeyatts);
+
 	if (itup)
 		_bt_metaversion(rel, &key->heapkeyspace, &key->allequalimage);
 	else
@@ -155,7 +156,7 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
 									   (AttrNumber) (i + 1),
-									   InvalidStrategy,
+									   BTEqualStrategyNumber,
 									   InvalidOid,
 									   rel->rd_indcollation[i],
 									   procinfo,
@@ -745,7 +746,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	int			numberOfKeys = scan->numberOfKeys;
 	int16	   *indoption = scan->indexRelation->rd_indoption;
 	int			new_numberOfKeys;
-	int			numberOfEqualCols;
+	int			numberOfEqualCols, numberOfEqualColsSincePrefix;
 	ScanKey		inkeys;
 	ScanKey		outkeys;
 	ScanKey		cur;
@@ -754,6 +755,7 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	int			i,
 				j;
 	AttrNumber	attno;
+	int			prefix = 0;
 
 	/* initialize result variables */
 	so->qual_ok = true;
@@ -762,6 +764,11 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	if (numberOfKeys < 1)
 		return;					/* done if qual-less scan */
 
+	if (_bt_skip_enabled(so))
+	{
+		prefix = so->skipData->prefix;
+	}
+
 	/*
 	 * Read so->arrayKeyData if array keys are present, else scan->keyData
 	 */
@@ -786,7 +793,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		so->numberOfKeys = 1;
 		/* We can mark the qual as required if it's for first index col */
 		if (cur->sk_attno == 1)
-			_bt_mark_scankey_required(outkeys);
+			_bt_mark_scankey_required(outkeys, SK_BT_REQFWD, SK_BT_REQBKWD);
+		if (cur->sk_attno <= prefix + 1)
+			_bt_mark_scankey_required(outkeys, SK_BT_REQSKIPFWD, SK_BT_REQSKIPBKWD);
 		return;
 	}
 
@@ -795,6 +804,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 	 */
 	new_numberOfKeys = 0;
 	numberOfEqualCols = 0;
+	numberOfEqualColsSincePrefix = 0;
+
 
 	/*
 	 * Initialize for processing of keys for attr 1.
@@ -830,6 +841,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 		if (i == numberOfKeys || cur->sk_attno != attno)
 		{
 			int			priorNumberOfEqualCols = numberOfEqualCols;
+			int			priorNumberOfEqualColsSincePrefix = numberOfEqualColsSincePrefix;
+
 
 			/* check input keys are correctly ordered */
 			if (i < numberOfKeys && cur->sk_attno < attno)
@@ -880,6 +893,8 @@ _bt_preprocess_keys(IndexScanDesc scan)
 				}
 				/* track number of attrs for which we have "=" keys */
 				numberOfEqualCols++;
+				if (attno > prefix)
+					numberOfEqualColsSincePrefix++;
 			}
 
 			/* try to keep only one of <, <= */
@@ -929,7 +944,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 
 					memcpy(outkey, xform[j], sizeof(ScanKeyData));
 					if (priorNumberOfEqualCols == attno - 1)
-						_bt_mark_scankey_required(outkey);
+						_bt_mark_scankey_required(outkey, SK_BT_REQFWD, SK_BT_REQBKWD);
+					if (attno <= prefix || priorNumberOfEqualColsSincePrefix == attno - prefix - 1)
+						_bt_mark_scankey_required(outkey, SK_BT_REQSKIPFWD, SK_BT_REQSKIPBKWD);
 				}
 			}
 
@@ -954,7 +971,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 
 			memcpy(outkey, cur, sizeof(ScanKeyData));
 			if (numberOfEqualCols == attno - 1)
-				_bt_mark_scankey_required(outkey);
+				_bt_mark_scankey_required(outkey, SK_BT_REQFWD, SK_BT_REQBKWD);
+			if (attno <= prefix || numberOfEqualColsSincePrefix == attno - prefix - 1)
+				_bt_mark_scankey_required(outkey, SK_BT_REQSKIPFWD, SK_BT_REQSKIPBKWD);
 
 			/*
 			 * We don't support RowCompare using equality; such a qual would
@@ -997,7 +1016,9 @@ _bt_preprocess_keys(IndexScanDesc scan)
 
 				memcpy(outkey, cur, sizeof(ScanKeyData));
 				if (numberOfEqualCols == attno - 1)
-					_bt_mark_scankey_required(outkey);
+					_bt_mark_scankey_required(outkey, SK_BT_REQFWD, SK_BT_REQBKWD);
+				if (attno <= prefix || numberOfEqualColsSincePrefix == attno - prefix - 1)
+					_bt_mark_scankey_required(outkey, SK_BT_REQSKIPFWD, SK_BT_REQSKIPBKWD);
 			}
 		}
 	}
@@ -1295,7 +1316,7 @@ _bt_fix_scankey_strategy(ScanKey skey, int16 *indoption)
  * anyway on a rescan.  Something to keep an eye on though.
  */
 static void
-_bt_mark_scankey_required(ScanKey skey)
+_bt_mark_scankey_required(ScanKey skey, int forwardReqFlag, int backwardReqFlag)
 {
 	int			addflags;
 
@@ -1303,14 +1324,14 @@ _bt_mark_scankey_required(ScanKey skey)
 	{
 		case BTLessStrategyNumber:
 		case BTLessEqualStrategyNumber:
-			addflags = SK_BT_REQFWD;
+			addflags = forwardReqFlag;
 			break;
 		case BTEqualStrategyNumber:
-			addflags = SK_BT_REQFWD | SK_BT_REQBKWD;
+			addflags = forwardReqFlag | backwardReqFlag;
 			break;
 		case BTGreaterEqualStrategyNumber:
 		case BTGreaterStrategyNumber:
-			addflags = SK_BT_REQBKWD;
+			addflags = backwardReqFlag;
 			break;
 		default:
 			elog(ERROR, "unrecognized StrategyNumber: %d",
@@ -1353,17 +1374,22 @@ _bt_mark_scankey_required(ScanKey skey)
  */
 bool
 _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
-			  ScanDirection dir, bool *continuescan)
+			  ScanDirection dir, bool *continuescan, int *prefixSkipIndex)
 {
 	TupleDesc	tupdesc;
 	BTScanOpaque so;
 	int			keysz;
 	int			ikey;
 	ScanKey		key;
+	int pfx;
+
+	if (prefixSkipIndex == NULL)
+		prefixSkipIndex = &pfx;
 
 	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
 
 	*continuescan = true;		/* default assumption */
+	*prefixSkipIndex = -1;
 
 	tupdesc = RelationGetDescr(scan->indexRelation);
 	so = (BTScanOpaque) scan->opaque;
@@ -1392,7 +1418,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 		if (key->sk_flags & SK_ROW_HEADER)
 		{
 			if (_bt_check_rowcompare(key, tuple, tupnatts, tupdesc, dir,
-									 continuescan))
+									 continuescan, prefixSkipIndex))
 				continue;
 			return false;
 		}
@@ -1429,6 +1455,13 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 					 ScanDirectionIsBackward(dir))
 				*continuescan = false;
 
+			if ((key->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+			else if ((key->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1452,6 +1485,10 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
 					*continuescan = false;
+
+				if ((key->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQSKIPBKWD)) &&
+					ScanDirectionIsBackward(dir))
+					*prefixSkipIndex = key->sk_attno - 1;
 			}
 			else
 			{
@@ -1468,6 +1505,9 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
 					*continuescan = false;
+				if ((key->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQSKIPBKWD)) &&
+									ScanDirectionIsBackward(dir))
+									*prefixSkipIndex = key->sk_attno - 1;
 			}
 
 			/*
@@ -1498,6 +1538,13 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 					 ScanDirectionIsBackward(dir))
 				*continuescan = false;
 
+			if ((key->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+			else if ((key->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir))
+				*prefixSkipIndex = key->sk_attno - 1;
+
 			/*
 			 * In any case, this indextuple doesn't match the qual.
 			 */
@@ -1509,6 +1556,228 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
 	return true;
 }
 
+bool
+_bt_checkkeys_threeway(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+			  ScanDirection dir, bool *continuescan, int *prefixSkipIndex)
+{
+	TupleDesc	tupdesc;
+	BTScanOpaque so;
+	int			keysz;
+	int			ikey;
+	ScanKey		key;
+	int pfx;
+	BTScanInsert keys;
+	bool overallmatch = true;
+
+	if (prefixSkipIndex == NULL)
+		prefixSkipIndex = &pfx;
+
+	Assert(BTreeTupleGetNAtts(tuple, scan->indexRelation) == tupnatts);
+
+	*continuescan = true;		/* default assumption */
+	*prefixSkipIndex = -1;
+
+	tupdesc = RelationGetDescr(scan->indexRelation);
+	so = (BTScanOpaque) scan->opaque;
+	if (ScanDirectionIsForward(dir))
+		keys = &so->skipData->bwdScanKey;
+	else
+		keys = &so->skipData->fwdScanKey;
+
+	keysz = keys->keysz;
+
+	for (key = keys->scankeys, ikey = 0; ikey < keysz; key++, ikey++)
+	{
+		Datum		datum;
+		bool		isNull;
+		int		cmpresult;
+
+		if (key->sk_attno == 0)
+			continue;
+
+		if (key->sk_attno > tupnatts)
+		{
+			/*
+			 * This attribute is truncated (must be high key).  The value for
+			 * this attribute in the first non-pivot tuple on the page to the
+			 * right could be any possible value.  Assume that truncated
+			 * attribute passes the qual.
+			 */
+			Assert(ScanDirectionIsForward(dir));
+			continue;
+		}
+
+		/* row-comparison keys need special processing */
+		Assert((key->sk_flags & SK_ROW_HEADER) == 0);
+
+		datum = index_getattr(tuple,
+							  key->sk_attno,
+							  tupdesc,
+							  &isNull);
+
+		if (key->sk_flags & SK_ISNULL)
+		{
+			/* Handle IS NULL/NOT NULL tests */
+			if (key->sk_flags & SK_SEARCHNULL)
+			{
+				if (isNull)
+					continue;	/* tuple satisfies this qual */
+			}
+			else
+			{
+				Assert(key->sk_flags & SK_SEARCHNOTNULL);
+				if (!isNull)
+					continue;	/* tuple satisfies this qual */
+			}
+
+			/*
+			 * Tuple fails this qual.  If it's a required qual for the current
+			 * scan direction, then we can conclude no further tuples will
+			 * pass, either.
+			 */
+			if ((key->sk_flags & SK_BT_REQFWD) &&
+				ScanDirectionIsForward(dir))
+			{
+				*continuescan = false;
+				return false;
+			}
+			else if ((key->sk_flags & SK_BT_REQBKWD) &&
+					 ScanDirectionIsBackward(dir))
+			{
+				*continuescan = false;
+				return false;
+			}
+
+			if ((key->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir))
+			{
+				*prefixSkipIndex = key->sk_attno - 1;
+				return false;
+			}
+			else if ((key->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir))
+			{
+				*prefixSkipIndex = key->sk_attno - 1;
+				return false;
+			}
+
+			overallmatch = false;
+		}
+
+		if (isNull)
+		{
+			if (key->sk_flags & SK_BT_NULLS_FIRST)
+			{
+				/*
+				 * Since NULLs are sorted before non-NULLs, we know we have
+				 * reached the lower limit of the range of values for this
+				 * index attr.  On a backward scan, we can stop if this qual
+				 * is one of the "must match" subset.  We can stop regardless
+				 * of whether the qual is > or <, so long as it's required,
+				 * because it's not possible for any future tuples to pass. On
+				 * a forward scan, however, we must keep going, because we may
+				 * have initially positioned to the start of the index.
+				 */
+				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+					ScanDirectionIsBackward(dir))
+				{
+					*continuescan = false;
+					return false;
+				}
+
+				if ((key->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQSKIPBKWD)) &&
+					ScanDirectionIsBackward(dir))
+				{
+					*prefixSkipIndex = key->sk_attno - 1;
+					return false;
+				}
+			}
+			else
+			{
+				/*
+				 * Since NULLs are sorted after non-NULLs, we know we have
+				 * reached the upper limit of the range of values for this
+				 * index attr.  On a forward scan, we can stop if this qual is
+				 * one of the "must match" subset.  We can stop regardless of
+				 * whether the qual is > or <, so long as it's required,
+				 * because it's not possible for any future tuples to pass. On
+				 * a backward scan, however, we must keep going, because we
+				 * may have initially positioned to the end of the index.
+				 */
+				if ((key->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
+					ScanDirectionIsForward(dir))
+				{
+					*continuescan = false;
+					return false;
+				}
+				if ((key->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQSKIPBKWD)) &&
+					ScanDirectionIsBackward(dir))
+				{
+					*prefixSkipIndex = key->sk_attno - 1;
+					return false;
+				}
+			}
+
+			overallmatch = false;
+		}
+
+		/* Perform the test --- three-way comparison not bool operator */
+		cmpresult = DatumGetInt32(FunctionCall2Coll(&key->sk_func,
+													key->sk_collation,
+													datum,
+													key->sk_argument));
+		if (key->sk_flags & SK_BT_DESC)
+			INVERT_COMPARE_RESULT(cmpresult);
+
+		if (cmpresult != 0)
+		{
+			/*
+			 * Tuple fails this qual.  If it's a required qual for the current
+			 * scan direction, then we can conclude no further tuples will
+			 * pass, either.
+			 *
+			 * Note: because we stop the scan as soon as any required equality
+			 * qual fails, it is critical that equality quals be used for the
+			 * initial positioning in _bt_first() when they are available. See
+			 * comments in _bt_first().
+			 */
+			if ((key->sk_flags & SK_BT_REQFWD) &&
+				ScanDirectionIsForward(dir) && cmpresult > 0)
+			{
+				*continuescan = false;
+				return false;
+			}
+			else if ((key->sk_flags & SK_BT_REQBKWD) &&
+					 ScanDirectionIsBackward(dir) && cmpresult < 0)
+			{
+				*continuescan = false;
+				return false;
+			}
+
+			if ((key->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir) && cmpresult > 0)
+			{
+				*prefixSkipIndex = key->sk_attno - 1;
+				return false;
+			}
+			else if ((key->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir) && cmpresult < 0)
+			{
+				*prefixSkipIndex = key->sk_attno - 1;
+				return false;
+			}
+
+			/*
+			 * In any case, this indextuple doesn't match the qual.
+			 */
+			overallmatch = false;
+		}
+	}
+
+	/* If we get here, the tuple passes all index quals. */
+	return overallmatch;
+}
+
 /*
  * Test whether an indextuple satisfies a row-comparison scan condition.
  *
@@ -1520,7 +1789,7 @@ _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
  */
 static bool
 _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
-					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan)
+					 TupleDesc tupdesc, ScanDirection dir, bool *continuescan, int *prefixSkipIndex)
 {
 	ScanKey		subkey = (ScanKey) DatumGetPointer(skey->sk_argument);
 	int32		cmpresult = 0;
@@ -1576,6 +1845,10 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsBackward(dir))
 					*continuescan = false;
+
+				if ((subkey->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQBKWD) &&
+					ScanDirectionIsBackward(dir)))
+					*prefixSkipIndex = subkey->sk_attno - 1;
 			}
 			else
 			{
@@ -1592,6 +1865,10 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 				if ((subkey->sk_flags & (SK_BT_REQFWD | SK_BT_REQBKWD)) &&
 					ScanDirectionIsForward(dir))
 					*continuescan = false;
+
+				if ((subkey->sk_flags & (SK_BT_REQSKIPFWD | SK_BT_REQBKWD) &&
+					ScanDirectionIsForward(dir)))
+					*prefixSkipIndex = subkey->sk_attno - 1;
 			}
 
 			/*
@@ -1616,6 +1893,13 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 			else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
 					 ScanDirectionIsBackward(dir))
 				*continuescan = false;
+
+			if ((subkey->sk_flags & SK_BT_REQSKIPFWD) &&
+				ScanDirectionIsForward(dir))
+				*prefixSkipIndex = subkey->sk_attno - 1;
+			else if ((subkey->sk_flags & SK_BT_REQSKIPBKWD) &&
+					 ScanDirectionIsBackward(dir))
+				*prefixSkipIndex = subkey->sk_attno - 1;
 			return false;
 		}
 
@@ -1678,6 +1962,13 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, int tupnatts,
 		else if ((subkey->sk_flags & SK_BT_REQBKWD) &&
 				 ScanDirectionIsBackward(dir))
 			*continuescan = false;
+
+		if ((subkey->sk_flags & SK_BT_REQSKIPFWD) &&
+			ScanDirectionIsForward(dir))
+			*prefixSkipIndex = subkey->sk_attno - 1;
+		else if ((subkey->sk_flags & SK_BT_REQSKIPBKWD) &&
+				 ScanDirectionIsBackward(dir))
+			*prefixSkipIndex = subkey->sk_attno - 1;
 	}
 
 	return result;
@@ -2733,3 +3024,524 @@ _bt_allequalimage(Relation rel, bool debugmessage)
 
 	return allequalimage;
 }
+
+void _bt_set_bsearch_flags(StrategyNumber stratTotal, ScanDirection dir, bool* nextkey, bool* goback)
+{
+	/*----------
+	 * Examine the selected initial-positioning strategy to determine exactly
+	 * where we need to start the scan, and set flag variables to control the
+	 * code below.
+	 *
+	 * If nextkey = false, _bt_search and _bt_binsrch will locate the first
+	 * item >= scan key.  If nextkey = true, they will locate the first
+	 * item > scan key.
+	 *
+	 * If goback = true, we will then step back one item, while if
+	 * goback = false, we will start the scan on the located item.
+	 *----------
+	 */
+	switch (stratTotal)
+	{
+		case BTLessStrategyNumber:
+
+			/*
+			 * Find first item >= scankey, then back up one to arrive at last
+			 * item < scankey.  (Note: this positioning strategy is only used
+			 * for a backward scan, so that is always the correct starting
+			 * position.)
+			 */
+			*nextkey = false;
+			*goback = true;
+			break;
+
+		case BTLessEqualStrategyNumber:
+
+			/*
+			 * Find first item > scankey, then back up one to arrive at last
+			 * item <= scankey.  (Note: this positioning strategy is only used
+			 * for a backward scan, so that is always the correct starting
+			 * position.)
+			 */
+			*nextkey = true;
+			*goback = true;
+			break;
+
+		case BTEqualStrategyNumber:
+
+			/*
+			 * If a backward scan was specified, need to start with last equal
+			 * item not first one.
+			 */
+			if (ScanDirectionIsBackward(dir))
+			{
+				/*
+				 * This is the same as the <= strategy.  We will check at the
+				 * end whether the found item is actually =.
+				 */
+				*nextkey = true;
+				*goback = true;
+			}
+			else
+			{
+				/*
+				 * This is the same as the >= strategy.  We will check at the
+				 * end whether the found item is actually =.
+				 */
+				*nextkey = false;
+				*goback = false;
+			}
+			break;
+
+		case BTGreaterEqualStrategyNumber:
+
+			/*
+			 * Find first item >= scankey.  (This is only used for forward
+			 * scans.)
+			 */
+			*nextkey = false;
+			*goback = false;
+			break;
+
+		case BTGreaterStrategyNumber:
+
+			/*
+			 * Find first item > scankey.  (This is only used for forward
+			 * scans.)
+			 */
+			*nextkey = true;
+			*goback = false;
+			break;
+
+		default:
+			/* can't get here, but keep compiler quiet */
+			elog(ERROR, "unrecognized strat_total: %d", (int) stratTotal);
+	}
+}
+
+bool _bt_create_insertion_scan_key(Relation	rel, ScanDirection dir, ScanKey* startKeys, int keysCount, BTScanInsert inskey, StrategyNumber* stratTotal,  bool* goback)
+{
+	int i;
+	bool nextkey;
+
+	/*
+	 * We want to start the scan somewhere within the index.  Set up an
+	 * insertion scankey we can use to search for the boundary point we
+	 * identified above.  The insertion scankey is built using the keys
+	 * identified by startKeys[].  (Remaining insertion scankey fields are
+	 * initialized after initial-positioning strategy is finalized.)
+	 */
+	Assert(keysCount <= INDEX_MAX_KEYS);
+	for (i = 0; i < keysCount; i++)
+	{
+		ScanKey		cur = startKeys[i];
+
+		if (cur == NULL)
+		{
+			inskey->scankeys[i].sk_attno = 0;
+			continue;
+		}
+
+		Assert(cur->sk_attno == i + 1);
+
+		if (cur->sk_flags & SK_ROW_HEADER)
+		{
+			/*
+			 * Row comparison header: look to the first row member instead.
+			 *
+			 * The member scankeys are already in insertion format (ie, they
+			 * have sk_func = 3-way-comparison function), but we have to watch
+			 * out for nulls, which _bt_preprocess_keys didn't check. A null
+			 * in the first row member makes the condition unmatchable, just
+			 * like qual_ok = false.
+			 */
+			ScanKey		subkey = (ScanKey) DatumGetPointer(cur->sk_argument);
+
+			Assert(subkey->sk_flags & SK_ROW_MEMBER);
+			if (subkey->sk_flags & SK_ISNULL)
+			{
+				return false;
+			}
+			memcpy(inskey->scankeys + i, subkey, sizeof(ScanKeyData));
+
+			/*
+			 * If the row comparison is the last positioning key we accepted,
+			 * try to add additional keys from the lower-order row members.
+			 * (If we accepted independent conditions on additional index
+			 * columns, we use those instead --- doesn't seem worth trying to
+			 * determine which is more restrictive.)  Note that this is OK
+			 * even if the row comparison is of ">" or "<" type, because the
+			 * condition applied to all but the last row member is effectively
+			 * ">=" or "<=", and so the extra keys don't break the positioning
+			 * scheme.  But, by the same token, if we aren't able to use all
+			 * the row members, then the part of the row comparison that we
+			 * did use has to be treated as just a ">=" or "<=" condition, and
+			 * so we'd better adjust strat_total accordingly.
+			 */
+			if (i == keysCount - 1)
+			{
+				bool		used_all_subkeys = false;
+
+				Assert(!(subkey->sk_flags & SK_ROW_END));
+				for (;;)
+				{
+					subkey++;
+					Assert(subkey->sk_flags & SK_ROW_MEMBER);
+					if (subkey->sk_attno != keysCount + 1)
+						break;	/* out-of-sequence, can't use it */
+					if (subkey->sk_strategy != cur->sk_strategy)
+						break;	/* wrong direction, can't use it */
+					if (subkey->sk_flags & SK_ISNULL)
+						break;	/* can't use null keys */
+					Assert(keysCount < INDEX_MAX_KEYS);
+					memcpy(inskey->scankeys + keysCount, subkey,
+						   sizeof(ScanKeyData));
+					keysCount++;
+					if (subkey->sk_flags & SK_ROW_END)
+					{
+						used_all_subkeys = true;
+						break;
+					}
+				}
+				if (!used_all_subkeys)
+				{
+					switch (*stratTotal)
+					{
+						case BTLessStrategyNumber:
+							*stratTotal = BTLessEqualStrategyNumber;
+							break;
+						case BTGreaterStrategyNumber:
+							*stratTotal = BTGreaterEqualStrategyNumber;
+							break;
+					}
+				}
+				break;			/* done with outer loop */
+			}
+		}
+		else
+		{
+			/*
+			 * Ordinary comparison key.  Transform the search-style scan key
+			 * to an insertion scan key by replacing the sk_func with the
+			 * appropriate btree comparison function.
+			 *
+			 * If scankey operator is not a cross-type comparison, we can use
+			 * the cached comparison function; otherwise gotta look it up in
+			 * the catalogs.  (That can't lead to infinite recursion, since no
+			 * indexscan initiated by syscache lookup will use cross-data-type
+			 * operators.)
+			 *
+			 * We support the convention that sk_subtype == InvalidOid means
+			 * the opclass input type; this is a hack to simplify life for
+			 * ScanKeyInit().
+			 */
+			if (cur->sk_subtype == rel->rd_opcintype[i] ||
+				cur->sk_subtype == InvalidOid)
+			{
+				FmgrInfo   *procinfo;
+
+				procinfo = index_getprocinfo(rel, cur->sk_attno, BTORDER_PROC);
+				ScanKeyEntryInitializeWithInfo(inskey->scankeys + i,
+											   cur->sk_flags,
+											   cur->sk_attno,
+											   cur->sk_strategy,
+											   cur->sk_subtype,
+											   cur->sk_collation,
+											   procinfo,
+											   cur->sk_argument);
+			}
+			else
+			{
+				RegProcedure cmp_proc;
+
+				cmp_proc = get_opfamily_proc(rel->rd_opfamily[i],
+											 rel->rd_opcintype[i],
+											 cur->sk_subtype,
+											 BTORDER_PROC);
+				if (!RegProcedureIsValid(cmp_proc))
+					elog(ERROR, "missing support function %d(%u,%u) for attribute %d of index \"%s\"",
+						 BTORDER_PROC, rel->rd_opcintype[i], cur->sk_subtype,
+						 cur->sk_attno, RelationGetRelationName(rel));
+				ScanKeyEntryInitialize(inskey->scankeys + i,
+									   cur->sk_flags,
+									   cur->sk_attno,
+									   cur->sk_strategy,
+									   cur->sk_subtype,
+									   cur->sk_collation,
+									   cmp_proc,
+									   cur->sk_argument);
+			}
+		}
+	}
+
+	_bt_set_bsearch_flags(*stratTotal, dir, &nextkey, goback);
+
+	/* Initialize remaining insertion scan key fields */
+	_bt_metaversion(rel, &inskey->heapkeyspace, &inskey->allequalimage);
+	inskey->anynullkeys = false; /* unused */
+	inskey->nextkey = nextkey;
+	inskey->pivotsearch = false;
+	inskey->scantid = NULL;
+	inskey->keysz = keysCount;
+
+	return true;
+}
+
+/*----------
+ * Examine the scan keys to discover where we need to start the scan.
+ *
+ * We want to identify the keys that can be used as starting boundaries;
+ * these are =, >, or >= keys for a forward scan or =, <, <= keys for
+ * a backwards scan.  We can use keys for multiple attributes so long as
+ * the prior attributes had only =, >= (resp. =, <=) keys.  Once we accept
+ * a > or < boundary or find an attribute with no boundary (which can be
+ * thought of as the same as "> -infinity"), we can't use keys for any
+ * attributes to its right, because it would break our simplistic notion
+ * of what initial positioning strategy to use.
+ *
+ * When the scan keys include cross-type operators, _bt_preprocess_keys
+ * may not be able to eliminate redundant keys; in such cases we will
+ * arbitrarily pick a usable one for each attribute.  This is correct
+ * but possibly not optimal behavior.  (For example, with keys like
+ * "x >= 4 AND x >= 5" we would elect to scan starting at x=4 when
+ * x=5 would be more efficient.)  Since the situation only arises given
+ * a poorly-worded query plus an incomplete opfamily, live with it.
+ *
+ * When both equality and inequality keys appear for a single attribute
+ * (again, only possible when cross-type operators appear), we *must*
+ * select one of the equality keys for the starting point, because
+ * _bt_checkkeys() will stop the scan as soon as an equality qual fails.
+ * For example, if we have keys like "x >= 4 AND x = 10" and we elect to
+ * start at x=4, we will fail and stop before reaching x=10.  If multiple
+ * equality quals survive preprocessing, however, it doesn't matter which
+ * one we use --- by definition, they are either redundant or
+ * contradictory.
+ *
+ * Any regular (not SK_SEARCHNULL) key implies a NOT NULL qualifier.
+ * If the index stores nulls at the end of the index we'll be starting
+ * from, and we have no boundary key for the column (which means the key
+ * we deduced NOT NULL from is an inequality key that constrains the other
+ * end of the index), then we cons up an explicit SK_SEARCHNOTNULL key to
+ * use as a boundary key.  If we didn't do this, we might find ourselves
+ * traversing a lot of null entries at the start of the scan.
+ *
+ * In this loop, row-comparison keys are treated the same as keys on their
+ * first (leftmost) columns.  We'll add on lower-order columns of the row
+ * comparison below, if possible.
+ *
+ * The selected scan keys (at most one per index column) are remembered by
+ * storing their addresses into the local startKeys[] array.
+ *----------
+ */
+int _bt_choose_scan_keys(ScanKey scanKeys, int numberOfKeys, ScanDirection dir, ScanKey* startKeys, ScanKeyData* notnullkeys, StrategyNumber* stratTotal, int prefix)
+{
+	StrategyNumber strat;
+	int			keysCount = 0;
+	int			i;
+
+	*stratTotal = BTEqualStrategyNumber;
+	if (numberOfKeys > 0 || prefix > 0)
+	{
+		AttrNumber	curattr;
+		ScanKey		chosen;
+		ScanKey		impliesNN;
+		ScanKey		cur;
+
+		/*
+		 * chosen is the so-far-chosen key for the current attribute, if any.
+		 * We don't cast the decision in stone until we reach keys for the
+		 * next attribute.
+		 */
+		curattr = 1;
+		chosen = NULL;
+		/* Also remember any scankey that implies a NOT NULL constraint */
+		impliesNN = NULL;
+
+		/*
+		 * Loop iterates from 0 to numberOfKeys inclusive; we use the last
+		 * pass to handle after-last-key processing.  Actual exit from the
+		 * loop is at one of the "break" statements below.
+		 */
+		for (cur = scanKeys, i = 0;; cur++, i++)
+		{
+			if (i >= numberOfKeys || cur->sk_attno != curattr)
+			{
+				/*
+				 * Done looking at keys for curattr.  If we didn't find a
+				 * usable boundary key, see if we can deduce a NOT NULL key.
+				 */
+				if (chosen == NULL && impliesNN != NULL &&
+					((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
+					 ScanDirectionIsForward(dir) :
+					 ScanDirectionIsBackward(dir)))
+				{
+					/* Yes, so build the key in notnullkeys[keysCount] */
+					chosen = &notnullkeys[keysCount];
+					ScanKeyEntryInitialize(chosen,
+										   (SK_SEARCHNOTNULL | SK_ISNULL |
+											(impliesNN->sk_flags &
+											 (SK_BT_DESC | SK_BT_NULLS_FIRST))),
+										   curattr,
+										   ((impliesNN->sk_flags & SK_BT_NULLS_FIRST) ?
+											BTGreaterStrategyNumber :
+											BTLessStrategyNumber),
+										   InvalidOid,
+										   InvalidOid,
+										   InvalidOid,
+										   (Datum) 0);
+				}
+
+				/*
+				 * If we still didn't find a usable boundary key, quit; else
+				 * save the boundary key pointer in startKeys.
+				 */
+				if (chosen == NULL && curattr > prefix)
+					break;
+				startKeys[keysCount++] = chosen;
+
+				/*
+				 * Adjust strat_total, and quit if we have stored a > or <
+				 * key.
+				 */
+				if (chosen != NULL && curattr > prefix)
+				{
+					strat = chosen->sk_strategy;
+					if (strat != BTEqualStrategyNumber)
+					{
+						*stratTotal = strat;
+						if (strat == BTGreaterStrategyNumber ||
+							strat == BTLessStrategyNumber)
+							break;
+					}
+				}
+
+				/*
+				 * Done if that was the last attribute, or if next key is not
+				 * in sequence (implying no boundary key is available for the
+				 * next attribute).
+				 */
+				if (i >= numberOfKeys)
+				{
+					curattr++;
+					while(curattr <= prefix)
+					{
+						startKeys[keysCount++] = NULL;
+						curattr++;
+					}
+					break;
+				}
+				else if (cur->sk_attno != curattr + 1)
+				{
+					curattr++;
+					while(curattr < cur->sk_attno && curattr <= prefix)
+					{
+						startKeys[keysCount++] = NULL;
+						curattr++;
+					}
+					if (curattr > prefix && curattr != cur->sk_attno)
+						break;
+				}
+				else
+				{
+					curattr++;
+				}
+
+				/*
+				 * Reset for next attr.
+				 */
+				chosen = NULL;
+				impliesNN = NULL;
+			}
+
+			/*
+			 * Can we use this key as a starting boundary for this attr?
+			 *
+			 * If not, does it imply a NOT NULL constraint?  (Because
+			 * SK_SEARCHNULL keys are always assigned BTEqualStrategyNumber,
+			 * *any* inequality key works for that; we need not test.)
+			 */
+			switch (cur->sk_strategy)
+			{
+				case BTLessStrategyNumber:
+				case BTLessEqualStrategyNumber:
+					if (chosen == NULL)
+					{
+						if (ScanDirectionIsBackward(dir))
+							chosen = cur;
+						else
+							impliesNN = cur;
+					}
+					break;
+				case BTEqualStrategyNumber:
+					/* override any non-equality choice */
+					chosen = cur;
+					break;
+				case BTGreaterEqualStrategyNumber:
+				case BTGreaterStrategyNumber:
+					if (chosen == NULL)
+					{
+						if (ScanDirectionIsForward(dir))
+							chosen = cur;
+						else
+							impliesNN = cur;
+					}
+					break;
+			}
+		}
+	}
+	return keysCount;
+}
+
+void print_itup(BlockNumber blk, IndexTuple left, IndexTuple right, Relation rel, char *extra)
+{
+	bool		isnull[INDEX_MAX_KEYS];
+	Datum		values[INDEX_MAX_KEYS];
+	char	   *lkey_desc = NULL;
+	char	   *rkey_desc;
+
+	/* Avoid infinite recursion -- don't instrument catalog indexes */
+	if (!IsCatalogRelation(rel))
+	{
+		TupleDesc	itupdesc = RelationGetDescr(rel);
+		int			natts;
+		int			indnkeyatts = rel->rd_index->indnkeyatts;
+
+		natts = BTreeTupleGetNAtts(left, rel);
+		itupdesc->natts = Min(indnkeyatts, natts);
+		memset(&isnull, 0xFF, sizeof(isnull));
+		index_deform_tuple(left, itupdesc, values, isnull);
+		rel->rd_index->indnkeyatts = natts;
+
+		/*
+		 * Since the regression tests should pass when the instrumentation
+		 * patch is applied, be prepared for BuildIndexValueDescription() to
+		 * return NULL due to security considerations.
+		 */
+		lkey_desc = BuildIndexValueDescription(rel, values, isnull);
+		if (lkey_desc && right)
+		{
+			/*
+			 * Revolting hack: modify tuple descriptor to have number of key
+			 * columns actually present in caller's pivot tuples
+			 */
+			natts = BTreeTupleGetNAtts(right, rel);
+			itupdesc->natts = Min(indnkeyatts, natts);
+			memset(&isnull, 0xFF, sizeof(isnull));
+			index_deform_tuple(right, itupdesc, values, isnull);
+			rel->rd_index->indnkeyatts = natts;
+			rkey_desc = BuildIndexValueDescription(rel, values, isnull);
+			elog(DEBUG1, "%s blk %u sk > %s, sk <= %s %s",
+				 RelationGetRelationName(rel), blk, lkey_desc, rkey_desc,
+				 extra);
+			pfree(rkey_desc);
+		}
+		else
+			elog(DEBUG1, "%s blk %u sk check %s %s",
+				 RelationGetRelationName(rel), blk, lkey_desc, extra);
+
+		/* Cleanup */
+		itupdesc->natts = IndexRelationGetNumberOfAttributes(rel);
+		rel->rd_index->indnkeyatts = indnkeyatts;
+		if (lkey_desc)
+			pfree(lkey_desc);
+	}
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 1ae7492216..a0f10ebbdc 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -73,6 +73,9 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
+	amroutine->amskip = NULL;
+	amroutine->ambeginskipscan = NULL;
+	amroutine->amgetskiptuple = NULL;
 	amroutine->amcostestimate = spgcostestimate;
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b970997c34..f2ce1893f9 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -152,6 +152,7 @@ static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
 static void ExplainIndentText(ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 
 
@@ -1114,6 +1115,22 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
 }
 
+/*
+ * ExplainIndexSkipScanKeys -
+ *	  Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es)
+{
+	if (skipPrefixSize > 0)
+	{
+		if (es->format != EXPLAIN_FORMAT_TEXT)
+			ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+	}
+}
+
 /*
  * ExplainNode -
  *	  Appends a description of a plan tree to es->str
@@ -1461,6 +1478,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexScan  *indexscan = (IndexScan *) plan;
 
+				if (indexscan->indexdistinct)
+					ExplainIndexSkipScanKeys(indexscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexscan->indexid,
 										indexscan->indexorderdir,
 										es);
@@ -1471,6 +1491,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
 
+				if (indexonlyscan->indexdistinct)
+					ExplainIndexSkipScanKeys(indexonlyscan->indexskipprefixsize, es);
+
 				ExplainIndexScanDetails(indexonlyscan->indexid,
 										indexonlyscan->indexorderdir,
 										es);
@@ -1731,6 +1754,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_IndexScan:
+			if (((IndexScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyText("Skip scan", ((IndexScan *) plan)->indexdistinct ? "Distinct only" : "All", es);
 			show_scan_qual(((IndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexScan *) plan)->indexqualorig)
@@ -1744,6 +1769,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 										   planstate, es);
 			break;
 		case T_IndexOnlyScan:
+			if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyText("Skip scan", ((IndexOnlyScan *) plan)->indexdistinct ? "Distinct only" : "All", es);
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexOnlyScan *) plan)->recheckqual)
@@ -1760,6 +1787,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 									 planstate->instrument->ntuples2, 0, es);
 			break;
 		case T_BitmapIndexScan:
+			if (((BitmapIndexScan *) plan)->indexskipprefixsize > 0)
+				ExplainPropertyText("Skip scan", "All", es);
 			show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
 			break;
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 043bb83f55..fb56109e89 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -133,6 +133,14 @@ ExecScanFetch(ScanState *node,
 	return (*accessMtd) (node);
 }
 
+TupleTableSlot *
+ExecScan(ScanState *node,
+		 ExecScanAccessMtd accessMtd,	/* function returning a tuple */
+		 ExecScanRecheckMtd recheckMtd)
+{
+	return ExecScanExtended(node, accessMtd, recheckMtd, NULL);
+}
+
 /* ----------------------------------------------------------------
  *		ExecScan
  *
@@ -155,9 +163,10 @@ ExecScanFetch(ScanState *node,
  * ----------------------------------------------------------------
  */
 TupleTableSlot *
-ExecScan(ScanState *node,
+ExecScanExtended(ScanState *node,
 		 ExecScanAccessMtd accessMtd,	/* function returning a tuple */
-		 ExecScanRecheckMtd recheckMtd)
+		 ExecScanRecheckMtd recheckMtd,
+		 ExecScanSkipMtd skipMtd)
 {
 	ExprContext *econtext;
 	ExprState  *qual;
@@ -170,6 +179,20 @@ ExecScan(ScanState *node,
 	projInfo = node->ps.ps_ProjInfo;
 	econtext = node->ps.ps_ExprContext;
 
+	if (skipMtd != NULL && node->ss_FirstTupleEmitted)
+	{
+		bool cont = skipMtd(node);
+		if (!cont)
+		{
+			node->ss_FirstTupleEmitted = false;
+			return ExecClearTuple(node->ss_ScanTupleSlot);
+		}
+	}
+	else
+	{
+		node->ss_FirstTupleEmitted = true;
+	}
+
 	/* interrupt checks are in ExecScanFetch */
 
 	/*
@@ -178,8 +201,13 @@ ExecScan(ScanState *node,
 	 */
 	if (!qual && !projInfo)
 	{
+		TupleTableSlot *slot;
+
 		ResetExprContext(econtext);
-		return ExecScanFetch(node, accessMtd, recheckMtd);
+		slot = ExecScanFetch(node, accessMtd, recheckMtd);
+		if (TupIsNull(slot))
+			node->ss_FirstTupleEmitted = false;
+		return slot;
 	}
 
 	/*
@@ -206,6 +234,7 @@ ExecScan(ScanState *node,
 		 */
 		if (TupIsNull(slot))
 		{
+			node->ss_FirstTupleEmitted = false;
 			if (projInfo)
 				return ExecClearTuple(projInfo->pi_state.resultslot);
 			else
@@ -306,6 +335,8 @@ ExecScanReScan(ScanState *node)
 	 */
 	ExecClearTuple(node->ss_ScanTupleSlot);
 
+	node->ss_FirstTupleEmitted = false;
+
 	/* Rescan EvalPlanQual tuple if we're inside an EvalPlanQual recheck */
 	if (estate->es_epq_active != NULL)
 	{
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 551e47630d..5c0401253d 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -22,13 +22,14 @@
 #include "postgres.h"
 
 #include "access/genam.h"
+#include "access/relscan.h"
 #include "executor/execdebug.h"
 #include "executor/nodeBitmapIndexscan.h"
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
+#include "utils/rel.h"
 #include "utils/memutils.h"
 
-
 /* ----------------------------------------------------------------
  *		ExecBitmapIndexScan
  *
@@ -223,6 +224,7 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecBitmapIndexScan;
+	indexstate->ss.ss_FirstTupleEmitted = false;
 
 	/* normally we don't make the result bitmap till runtime */
 	indexstate->biss_result = NULL;
@@ -308,10 +310,20 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
 	/*
 	 * Initialize scan descriptor.
 	 */
-	indexstate->biss_ScanDesc =
-		index_beginscan_bitmap(indexstate->biss_RelationDesc,
-							   estate->es_snapshot,
-							   indexstate->biss_NumScanKeys);
+	if (node->indexskipprefixsize > 0)
+	{
+		indexstate->biss_ScanDesc =
+			index_beginscan_bitmap_skip(indexstate->biss_RelationDesc,
+				estate->es_snapshot,
+				indexstate->biss_NumScanKeys,
+				Min(IndexRelationGetNumberOfKeyAttributes(indexstate->biss_RelationDesc),
+					node->indexskipprefixsize));
+	}
+	else
+		indexstate->biss_ScanDesc =
+			index_beginscan_bitmap(indexstate->biss_RelationDesc,
+								   estate->es_snapshot,
+								   indexstate->biss_NumScanKeys);
 
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index eb3ddd2943..f3ea4d4417 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -41,6 +41,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/itemptr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -49,6 +50,37 @@ static TupleTableSlot *IndexOnlyNext(IndexOnlyScanState *node);
 static void StoreIndexTuple(TupleTableSlot *slot, IndexTuple itup,
 							TupleDesc itupdesc);
 
+static bool
+IndexOnlySkip(IndexOnlyScanState *node)
+{
+	EState	   *estate;
+	ScanDirection direction;
+	IndexScanDesc scandesc;
+	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
+
+	if (!node->ioss_Distinct)
+		return true;
+
+	/*
+	 * extract necessary information from index scan node
+	 */
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	/* flip direction if this is an overall backward scan */
+	if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
+	{
+		if (ScanDirectionIsForward(direction))
+			direction = BackwardScanDirection;
+		else if (ScanDirectionIsBackward(direction))
+			direction = ForwardScanDirection;
+	}
+	scandesc = node->ioss_ScanDesc;
+
+	if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir))
+		return false;
+
+	return true;
+}
 
 /* ----------------------------------------------------------------
  *		IndexOnlyNext
@@ -65,6 +97,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid;
+	ItemPointerData startTid;
+	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
 
 	/*
 	 * extract necessary information from index scan node
@@ -72,7 +106,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -90,11 +124,19 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		 * serially executing an index only scan that was planned to be
 		 * parallel.
 		 */
-		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->ioss_RelationDesc,
-								   estate->es_snapshot,
-								   node->ioss_NumScanKeys,
-								   node->ioss_NumOrderByKeys);
+		if (node->ioss_SkipPrefixSize > 0)
+			scandesc = index_beginscan_skip(node->ss.ss_currentRelation,
+									   node->ioss_RelationDesc,
+									   estate->es_snapshot,
+									   node->ioss_NumScanKeys,
+									   node->ioss_NumOrderByKeys,
+									   Min(IndexRelationGetNumberOfKeyAttributes(node->ioss_RelationDesc), node->ioss_SkipPrefixSize));
+		else
+			scandesc = index_beginscan(node->ss.ss_currentRelation,
+									   node->ioss_RelationDesc,
+									   estate->es_snapshot,
+									   node->ioss_NumScanKeys,
+									   node->ioss_NumOrderByKeys);
 
 		node->ioss_ScanDesc = scandesc;
 
@@ -114,11 +156,16 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
 	}
+	else
+	{
+		ItemPointerCopy(&scandesc->xs_heaptid, &startTid);
+	}
 
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while ((tid = node->ioss_SkipPrefixSize > 0 ? index_getnext_tid_skip(scandesc, direction, node->ioss_Distinct ? indexonlyscan->indexorderdir : direction) :
+			index_getnext_tid(scandesc, direction)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
@@ -312,9 +359,10 @@ ExecIndexOnlyScan(PlanState *pstate)
 	if (node->ioss_NumRuntimeKeys != 0 && !node->ioss_RuntimeKeysReady)
 		ExecReScan((PlanState *) node);
 
-	return ExecScan(&node->ss,
+	return ExecScanExtended(&node->ss,
 					(ExecScanAccessMtd) IndexOnlyNext,
-					(ExecScanRecheckMtd) IndexOnlyRecheck);
+					(ExecScanRecheckMtd) IndexOnlyRecheck,
+					node->ioss_Distinct ? (ExecScanSkipMtd) IndexOnlySkip : NULL);
 }
 
 /* ----------------------------------------------------------------
@@ -502,6 +550,9 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+	indexstate->ss.ss_FirstTupleEmitted = false;
+	indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->ioss_Distinct = node->indexdistinct;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index a91f135be7..349c356584 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -69,6 +69,37 @@ static void reorderqueue_push(IndexScanState *node, TupleTableSlot *slot,
 							  Datum *orderbyvals, bool *orderbynulls);
 static HeapTuple reorderqueue_pop(IndexScanState *node);
 
+static bool
+IndexSkip(IndexScanState *node)
+{
+	EState	   *estate;
+	ScanDirection direction;
+	IndexScanDesc scandesc;
+	IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
+
+	if (!node->iss_Distinct)
+		return true;
+
+	/*
+	 * extract necessary information from index scan node
+	 */
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	/* flip direction if this is an overall backward scan */
+	if (ScanDirectionIsBackward(indexscan->indexorderdir))
+	{
+		if (ScanDirectionIsForward(direction))
+			direction = BackwardScanDirection;
+		else if (ScanDirectionIsBackward(direction))
+			direction = ForwardScanDirection;
+	}
+	scandesc = node->iss_ScanDesc;
+
+	if (!index_skip(scandesc, direction, indexscan->indexorderdir))
+		return false;
+
+	return true;
+}
 
 /* ----------------------------------------------------------------
  *		IndexNext
@@ -85,6 +116,7 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
 
 	/*
 	 * extract necessary information from index scan node
@@ -92,7 +124,7 @@ IndexNext(IndexScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -109,14 +141,25 @@ IndexNext(IndexScanState *node)
 		 * We reach here if the index scan is not parallel, or if we're
 		 * serially executing an index scan that was planned to be parallel.
 		 */
-		scandesc = index_beginscan(node->ss.ss_currentRelation,
-								   node->iss_RelationDesc,
-								   estate->es_snapshot,
-								   node->iss_NumScanKeys,
-								   node->iss_NumOrderByKeys);
+		if (node->iss_SkipPrefixSize > 0)
+			scandesc = index_beginscan_skip(node->ss.ss_currentRelation,
+									   node->iss_RelationDesc,
+									   estate->es_snapshot,
+									   node->iss_NumScanKeys,
+									   node->iss_NumOrderByKeys,
+									   Min(IndexRelationGetNumberOfKeyAttributes(node->iss_RelationDesc), node->iss_SkipPrefixSize));
+		else
+			scandesc = index_beginscan(node->ss.ss_currentRelation,
+									   node->iss_RelationDesc,
+									   estate->es_snapshot,
+									   node->iss_NumScanKeys,
+									   node->iss_NumOrderByKeys);
 
 		node->iss_ScanDesc = scandesc;
 
+		/* Index skip scan assumes xs_want_itup, so set it to true if we skip over distinct */
+		node->iss_ScanDesc->xs_want_itup = indexscan->indexdistinct;
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -130,7 +173,9 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (node->iss_SkipPrefixSize > 0 ?
+		   index_getnext_slot_skip(scandesc, direction, node->iss_Distinct ? indexscan->indexorderdir : direction, slot) :
+		   index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
 
@@ -530,13 +575,15 @@ ExecIndexScan(PlanState *pstate)
 		ExecReScan((PlanState *) node);
 
 	if (node->iss_NumOrderByKeys > 0)
-		return ExecScan(&node->ss,
+		return ExecScanExtended(&node->ss,
 						(ExecScanAccessMtd) IndexNextWithReorder,
-						(ExecScanRecheckMtd) IndexRecheck);
+						(ExecScanRecheckMtd) IndexRecheck,
+						node->iss_Distinct ? (ExecScanSkipMtd) IndexSkip : NULL);
 	else
-		return ExecScan(&node->ss,
+		return ExecScanExtended(&node->ss,
 						(ExecScanAccessMtd) IndexNext,
-						(ExecScanRecheckMtd) IndexRecheck);
+						(ExecScanRecheckMtd) IndexRecheck,
+						node->iss_Distinct ? (ExecScanSkipMtd) IndexSkip : NULL);
 }
 
 /* ----------------------------------------------------------------
@@ -910,6 +957,9 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+	indexstate->ss.ss_FirstTupleEmitted = false;
+	indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->iss_Distinct = node->indexdistinct;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index fa927a3044..f3d0054a38 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -497,6 +497,8 @@ _copyIndexScan(const IndexScan *from)
 	COPY_NODE_FIELD(indexorderbyorig);
 	COPY_NODE_FIELD(indexorderbyops);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
+	COPY_SCALAR_FIELD(indexdistinct);
 
 	return newnode;
 }
@@ -523,6 +525,8 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
 	COPY_NODE_FIELD(indexorderby);
 	COPY_NODE_FIELD(indextlist);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
+	COPY_SCALAR_FIELD(indexdistinct);
 
 	return newnode;
 }
@@ -547,6 +551,7 @@ _copyBitmapIndexScan(const BitmapIndexScan *from)
 	COPY_SCALAR_FIELD(isshared);
 	COPY_NODE_FIELD(indexqual);
 	COPY_NODE_FIELD(indexqualorig);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 2369d26c8c..8718fffd26 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -569,6 +569,8 @@ _outIndexScan(StringInfo str, const IndexScan *node)
 	WRITE_NODE_FIELD(indexorderbyorig);
 	WRITE_NODE_FIELD(indexorderbyops);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
+	WRITE_INT_FIELD(indexdistinct);
 }
 
 static void
@@ -584,6 +586,9 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
 	WRITE_NODE_FIELD(indexorderby);
 	WRITE_NODE_FIELD(indextlist);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
+	WRITE_INT_FIELD(indexdistinct);
+
 }
 
 static void
@@ -597,6 +602,7 @@ _outBitmapIndexScan(StringInfo str, const BitmapIndexScan *node)
 	WRITE_BOOL_FIELD(isshared);
 	WRITE_NODE_FIELD(indexqual);
 	WRITE_NODE_FIELD(indexqualorig);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 7b1a2a397c..a0580bf306 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1876,6 +1876,8 @@ _readIndexScan(void)
 	READ_NODE_FIELD(indexorderbyorig);
 	READ_NODE_FIELD(indexorderbyops);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
+	READ_INT_FIELD(indexdistinct);
 
 	READ_DONE();
 }
@@ -1896,6 +1898,8 @@ _readIndexOnlyScan(void)
 	READ_NODE_FIELD(indexorderby);
 	READ_NODE_FIELD(indextlist);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
+	READ_INT_FIELD(indexdistinct);
 
 	READ_DONE();
 }
@@ -1914,6 +1918,7 @@ _readBitmapIndexScan(void)
 	READ_BOOL_FIELD(isshared);
 	READ_NODE_FIELD(indexqual);
 	READ_NODE_FIELD(indexqualorig);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index ac95015b56..610528f600 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1756,7 +1756,9 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 		List	   *startup_subpaths = NIL;
 		List	   *total_subpaths = NIL;
 		List	   *fractional_subpaths = NIL;
+		List	   *uniq_total_subpaths = NIL;
 		bool		startup_neq_total = false;
+		bool		uniq_neq_total = false;
 		ListCell   *lcr;
 		bool		match_partition_order;
 		bool		match_partition_order_desc;
@@ -1786,7 +1788,8 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 			RelOptInfo *childrel = (RelOptInfo *) lfirst(lcr);
 			Path	   *cheapest_startup,
 					   *cheapest_total,
-					   *cheapest_fractional = NULL;
+					   *cheapest_fractional = NULL,
+						*cheapest_uniq_total = NULL;
 
 			/* Locate the right paths, if they are available. */
 			cheapest_startup =
@@ -1802,6 +1805,19 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 											   TOTAL_COST,
 											   false);
 
+			cheapest_uniq_total =
+				get_cheapest_path_for_pathkeys(childrel->unique_pathlist,
+											   pathkeys,
+											   NULL,
+											   TOTAL_COST,
+											   false);
+
+			if (cheapest_uniq_total != NULL && !uniq_neq_total)
+			{
+				uniq_neq_total = true;
+				uniq_total_subpaths = list_copy(total_subpaths);
+			}
+
 			/*
 			 * If we can't find any paths with the right order just use the
 			 * cheapest-total path; we'll have to sort it later.
@@ -1814,6 +1830,9 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 				Assert(cheapest_total->param_info == NULL);
 			}
 
+			if (cheapest_uniq_total == NULL)
+				cheapest_uniq_total = cheapest_total;
+
 			/*
 			 * When building a fractional path, determine a cheapest fractional
 			 * path for each child relation too. Looking at startup and total
@@ -1877,6 +1896,11 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 					cheapest_fractional = get_singleton_append_subpath(cheapest_fractional);
 					fractional_subpaths = lappend(fractional_subpaths, cheapest_fractional);
 				}
+				if (uniq_neq_total)
+				{
+					cheapest_uniq_total = get_singleton_append_subpath(cheapest_uniq_total);
+					uniq_total_subpaths = lappend(uniq_total_subpaths, cheapest_uniq_total);
+				}
 			}
 			else if (match_partition_order_desc)
 			{
@@ -1896,6 +1920,11 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 					cheapest_fractional = get_singleton_append_subpath(cheapest_fractional);
 					fractional_subpaths = lcons(cheapest_fractional, fractional_subpaths);
 				}
+				if (uniq_neq_total)
+				{
+					cheapest_uniq_total = get_singleton_append_subpath(cheapest_uniq_total);
+					uniq_total_subpaths = lcons(cheapest_uniq_total, uniq_total_subpaths);
+				}
 			}
 			else
 			{
@@ -1911,6 +1940,10 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 				if (cheapest_fractional)
 					accumulate_append_subpath(cheapest_fractional,
 											  &fractional_subpaths, NULL);
+
+				if (uniq_neq_total)
+					accumulate_append_subpath(cheapest_uniq_total,
+											  &uniq_total_subpaths, NULL);
 			}
 		}
 
@@ -1948,6 +1981,17 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 														  0,
 														  false,
 														  -1));
+
+			if (uniq_neq_total)
+				add_unique_path(rel, (Path *) create_append_path(root,
+														  rel,
+														  uniq_total_subpaths,
+														  NIL,
+														  pathkeys,
+														  NULL,
+														  0,
+														  false,
+														  -1));
 		}
 		else
 		{
@@ -1970,6 +2014,12 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 																fractional_subpaths,
 																pathkeys,
 																NULL));
+			if (uniq_neq_total)
+				add_unique_path(rel, (Path *)  create_merge_append_path(root,
+																		rel,
+																		uniq_total_subpaths,
+																		pathkeys,
+																		NULL));
 		}
 	}
 }
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 8dc7dd4ca2..efb2954338 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -133,6 +133,7 @@ int			max_parallel_workers_per_gather = 2;
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
+bool		enable_indexskipscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 0ef70ad7f1..7c8889a350 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -784,6 +784,16 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexPath  *ipath = (IndexPath *) lfirst(lc);
 
+		/*
+		 * To prevent unique paths from index skip scans being potentially used
+		 * when not needed scan keep them in a separate pathlist.
+		*/
+		if (ipath->indexdistinct)
+		{
+			add_unique_path(rel, (Path *) ipath);
+			continue;
+		}
+
 		if (index->amhasgettuple)
 			add_path(rel, (Path *) ipath);
 
@@ -872,6 +882,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
+	bool		can_skip;
 	int			indexcol;
 
 	/*
@@ -1021,6 +1032,9 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	index_only_scan = (scantype != ST_BITMAPSCAN &&
 					   check_index_only(rel, index));
 
+	/* Check if an index skip scan is possible. */
+	can_skip = enable_indexskipscan & index->amcanskip;
+
 	/*
 	 * 4. Generate an indexscan path if there are relevant restriction clauses
 	 * in the current clauses, OR the index ordering is potentially useful for
@@ -1044,6 +1058,33 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  false);
 		result = lappend(result, ipath);
 
+		/* Consider index skip scan as well */
+		if (root->query_uniquekeys != NULL && can_skip)
+		{
+			int numusefulkeys = list_length(useful_pathkeys);
+			int numsortkeys = list_length(root->query_pathkeys);
+
+			if (numusefulkeys == numsortkeys)
+			{
+				int prefix;
+				if (list_length(root->distinct_pathkeys) > 0)
+					prefix = find_index_prefix_for_pathkey(root,
+														   index,
+														   ForwardScanDirection,
+														   llast_node(PathKey,
+														   root->distinct_pathkeys));
+				else
+					/* all are distinct keys are constant and optimized away.
+					 * skipping with 1 is sufficient as all are constant anyway
+					 */
+					prefix = 1;
+
+				result = lappend(result,
+								 create_skipscan_unique_path(root, index,
+															 (Path *) ipath, prefix));
+			}
+		}
+
 		/*
 		 * If appropriate, consider parallel index scan.  We don't allow
 		 * parallel index scan for bitmap index scans.
@@ -1099,6 +1140,33 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  false);
 			result = lappend(result, ipath);
 
+			/* Consider index skip scan as well */
+			if (root->query_uniquekeys != NULL && can_skip)
+			{
+				int numusefulkeys = list_length(useful_pathkeys);
+				int numsortkeys = list_length(root->query_pathkeys);
+
+				if (numusefulkeys == numsortkeys)
+				{
+					int prefix;
+					if (list_length(root->distinct_pathkeys) > 0)
+						prefix = find_index_prefix_for_pathkey(root,
+															   index,
+															   BackwardScanDirection,
+															   llast_node(PathKey,
+															   root->distinct_pathkeys));
+					else
+						/* all are distinct keys are constant and optimized away.
+						 * skipping with 1 is sufficient as all are constant anyway
+						 */
+						prefix = 1;
+
+					result = lappend(result,
+									 create_skipscan_unique_path(root, index,
+																 (Path *) ipath, prefix));
+				}
+			}
+
 			/* If appropriate, consider parallel index scan */
 			if (index->amcanparallel &&
 				rel->consider_parallel && outer_relids == NULL &&
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 9b7cdce350..cc90e15952 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -522,6 +522,78 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  *		NEW PATHKEY FORMATION
  ****************************************************************************/
 
+/*
+ * Find the prefix size for a specific path key in an index.
+ * For example, an index with (a,b,c) finding path key b will
+ * return prefix 2.
+ * Returns 0 when not found.
+ */
+int
+find_index_prefix_for_pathkey(PlannerInfo *root,
+					 IndexOptInfo *index,
+					 ScanDirection scandir,
+					 PathKey *pathkey)
+{
+	ListCell   *lc;
+	int			i;
+
+	i = 0;
+	foreach(lc, index->indextlist)
+	{
+		TargetEntry *indextle = (TargetEntry *) lfirst(lc);
+		Expr	   *indexkey;
+		bool		reverse_sort;
+		bool		nulls_first;
+		PathKey    *cpathkey;
+
+		/*
+		 * INCLUDE columns are stored in index unordered, so they don't
+		 * support ordered index scan.
+		 */
+		if (i >= index->nkeycolumns)
+			break;
+
+		/* We assume we don't need to make a copy of the tlist item */
+		indexkey = indextle->expr;
+
+		if (ScanDirectionIsBackward(scandir))
+		{
+			reverse_sort = !index->reverse_sort[i];
+			nulls_first = !index->nulls_first[i];
+		}
+		else
+		{
+			reverse_sort = index->reverse_sort[i];
+			nulls_first = index->nulls_first[i];
+		}
+
+		/*
+		 * OK, try to make a canonical pathkey for this sort key.  Note we're
+		 * underneath any outer joins, so nullable_relids should be NULL.
+		 */
+		cpathkey = make_pathkey_from_sortinfo(root,
+											  indexkey,
+											  NULL,
+											  index->sortopfamily[i],
+											  index->opcintype[i],
+											  index->indexcollations[i],
+											  reverse_sort,
+											  nulls_first,
+											  0,
+											  index->rel->relids,
+											  false);
+
+		if (cpathkey == pathkey)
+		{
+			return i + 1;
+		}
+
+		i++;
+	}
+
+	return 0;
+}
+
 /*
  * build_index_pathkeys
  *	  Build a pathkeys list that describes the ordering induced by an index
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index cd6d72c763..9df4d95ebf 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -185,16 +185,21 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 								 Oid indexid, List *indexqual, List *indexqualorig,
 								 List *indexorderby, List *indexorderbyorig,
 								 List *indexorderbyops,
-								 ScanDirection indexscandir);
+								 ScanDirection indexscandir,
+								 int skipprefix,
+								 bool distinct);
 static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 Index scanrelid, Oid indexid,
 										 List *indexqual, List *recheckqual,
 										 List *indexorderby,
 										 List *indextlist,
-										 ScanDirection indexscandir);
+										 ScanDirection indexscandir,
+										 int skipprefix,
+										 bool distinct);
 static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
 											  List *indexqual,
-											  List *indexqualorig);
+											  List *indexqualorig,
+											  int skipPrefixSize);
 static BitmapHeapScan *make_bitmap_heapscan(List *qptlist,
 											List *qpqual,
 											Plan *lefttree,
@@ -3092,7 +3097,9 @@ create_indexscan_plan(PlannerInfo *root,
 												stripped_indexquals,
 												fixed_indexorderbys,
 												indexinfo->indextlist,
-												best_path->indexscandir);
+												best_path->indexscandir,
+												best_path->indexskipprefix,
+												best_path->indexdistinct);
 	else
 		scan_plan = (Scan *) make_indexscan(tlist,
 											qpqual,
@@ -3103,7 +3110,9 @@ create_indexscan_plan(PlannerInfo *root,
 											fixed_indexorderbys,
 											indexorderbys,
 											indexorderbyops,
-											best_path->indexscandir);
+											best_path->indexscandir,
+											best_path->indexskipprefix,
+											best_path->indexdistinct);
 
 	copy_generic_path_info(&scan_plan->plan, &best_path->path);
 
@@ -3393,7 +3402,8 @@ create_bitmap_subplan(PlannerInfo *root, Path *bitmapqual,
 		plan = (Plan *) make_bitmap_indexscan(iscan->scan.scanrelid,
 											  iscan->indexid,
 											  iscan->indexqual,
-											  iscan->indexqualorig);
+											  iscan->indexqualorig,
+											  iscan->indexskipprefixsize);
 		/* and set its cost/width fields appropriately */
 		plan->startup_cost = 0.0;
 		plan->total_cost = ipath->indextotalcost;
@@ -5436,7 +5446,9 @@ make_indexscan(List *qptlist,
 			   List *indexorderby,
 			   List *indexorderbyorig,
 			   List *indexorderbyops,
-			   ScanDirection indexscandir)
+			   ScanDirection indexscandir,
+			   int skipPrefixSize,
+			   bool distinct)
 {
 	IndexScan  *node = makeNode(IndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5453,6 +5465,8 @@ make_indexscan(List *qptlist,
 	node->indexorderbyorig = indexorderbyorig;
 	node->indexorderbyops = indexorderbyops;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
+	node->indexdistinct = distinct;
 
 	return node;
 }
@@ -5466,7 +5480,9 @@ make_indexonlyscan(List *qptlist,
 				   List *recheckqual,
 				   List *indexorderby,
 				   List *indextlist,
-				   ScanDirection indexscandir)
+				   ScanDirection indexscandir,
+				   int skipPrefixSize,
+				   bool distinct)
 {
 	IndexOnlyScan *node = makeNode(IndexOnlyScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5482,6 +5498,8 @@ make_indexonlyscan(List *qptlist,
 	node->indexorderby = indexorderby;
 	node->indextlist = indextlist;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
+	node->indexdistinct = distinct;
 
 	return node;
 }
@@ -5490,7 +5508,8 @@ static BitmapIndexScan *
 make_bitmap_indexscan(Index scanrelid,
 					  Oid indexid,
 					  List *indexqual,
-					  List *indexqualorig)
+					  List *indexqualorig,
+					  int skipPrefixSize)
 {
 	BitmapIndexScan *node = makeNode(BitmapIndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5503,6 +5522,7 @@ make_bitmap_indexscan(Index scanrelid,
 	node->indexid = indexid;
 	node->indexqual = indexqual;
 	node->indexqualorig = indexqualorig;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 5ae2475400..6a7ac49d6e 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3109,12 +3109,18 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 
 	if (parse->distinctClause &&
 		grouping_is_sortable(parse->distinctClause))
+	{
 		root->distinct_pathkeys =
 			make_pathkeys_for_sortclauses(root,
 										  parse->distinctClause,
 										  tlist);
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
+	{
 		root->distinct_pathkeys = NIL;
+		root->query_uniquekeys = NIL;
+	}
 
 	root->sort_pathkeys =
 		make_pathkeys_for_sortclauses(root,
@@ -4441,7 +4447,7 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
 							RelOptInfo *distinct_rel)
 {
 	Query	   *parse = root->parse;
-	Path	   *cheapest_input_path = input_rel->cheapest_total_path;
+	Path	   *cheapest_input_path = input_rel->cheapest_distinct_unique_path;
 	double		numDistinctRows;
 	bool		allow_hash;
 	Path	   *path;
@@ -4514,8 +4520,14 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
 		{
 			Path	   *path = (Path *) lfirst(lc);
 
-			if (query_has_uniquekeys_for(root, needed_pathkeys, false))
+			if (query_has_uniquekeys_for(root, path->uniquekeys, false))
 				add_path(distinct_rel, path);
+			else if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
+				add_path(distinct_rel, (Path *)
+						 create_upper_unique_path(root, distinct_rel,
+												  path,
+												  list_length(root->distinct_pathkeys),
+												  numDistinctRows));
 		}
 
 		/* For explicit-sort case, always use the more rigorous clause */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index abb77d867e..ac321bf31d 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -245,6 +245,7 @@ set_cheapest(RelOptInfo *parent_rel)
 {
 	Path	   *cheapest_startup_path;
 	Path	   *cheapest_total_path;
+	Path	   *cheapest_distinct_unique_path;
 	Path	   *best_param_path;
 	List	   *parameterized_paths;
 	ListCell   *p;
@@ -256,6 +257,7 @@ set_cheapest(RelOptInfo *parent_rel)
 
 	cheapest_startup_path = cheapest_total_path = best_param_path = NULL;
 	parameterized_paths = NIL;
+	cheapest_distinct_unique_path = NULL;
 
 	foreach(p, parent_rel->pathlist)
 	{
@@ -354,6 +356,36 @@ set_cheapest(RelOptInfo *parent_rel)
 		cheapest_total_path = best_param_path;
 	Assert(cheapest_total_path != NULL);
 
+	cheapest_distinct_unique_path = cheapest_total_path;
+
+	foreach(p, parent_rel->unique_pathlist)
+	{
+		Path	   *path = (Path *) lfirst(p);
+		int			cmp;
+
+		/* Unparameterized path, so consider it for cheapest slots */
+		if (cheapest_distinct_unique_path == NULL)
+		{
+			cheapest_distinct_unique_path = path;
+			continue;
+		}
+
+		/*
+		 * If we find two paths of identical costs, try to keep the
+		 * better-sorted one.  The paths might have unrelated sort
+		 * orderings, in which case we can only guess which might be
+		 * better to keep, but if one is superior then we definitely
+		 * should keep that one.
+		 */
+		cmp = compare_path_costs(cheapest_distinct_unique_path, path, TOTAL_COST);
+		if (cmp > 0 ||
+			(cmp == 0 &&
+			 compare_pathkeys(cheapest_distinct_unique_path->pathkeys,
+							  path->pathkeys) == PATHKEYS_BETTER2))
+			cheapest_distinct_unique_path = path;
+	}
+
+	parent_rel->cheapest_distinct_unique_path = cheapest_distinct_unique_path;
 	parent_rel->cheapest_startup_path = cheapest_startup_path;
 	parent_rel->cheapest_total_path = cheapest_total_path;
 	parent_rel->cheapest_unique_path = NULL;	/* computed only if needed */
@@ -1293,6 +1325,10 @@ create_append_path(PlannerInfo *root,
 	pathnode->path.parallel_safe = rel->consider_parallel;
 	pathnode->path.parallel_workers = parallel_workers;
 	pathnode->path.pathkeys = pathkeys;
+	if (list_length(subpaths) == 1)
+	{
+		pathnode->path.uniquekeys = ((Path*)linitial(subpaths))->uniquekeys;
+	}
 
 	/*
 	 * For parallel append, non-partial paths are sorted by descending total
@@ -1437,6 +1473,10 @@ create_merge_append_path(PlannerInfo *root,
 	pathnode->path.parallel_workers = 0;
 	pathnode->path.pathkeys = pathkeys;
 	pathnode->subpaths = subpaths;
+	if (list_length(subpaths) == 1)
+	{
+		pathnode->path.uniquekeys = ((Path*)linitial(subpaths))->uniquekeys;
+	}
 
 	/*
 	 * Apply query-wide LIMIT if known and path is for sole base relation.
@@ -3095,6 +3135,44 @@ create_upper_unique_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_skipscan_unique_path
+ *	  Creates a pathnode the same as an existing IndexPath except based on
+ *	  skipping duplicate values.  This may or may not be cheaper than using
+ *	  create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root, IndexOptInfo *index,
+							Path *basepath, int prefix)
+{
+	IndexPath 	*pathnode = makeNode(IndexPath);
+	int 		numDistinctRows;
+	UniqueKey *ukey;
+
+	Assert(IsA(basepath, IndexPath));
+
+	/* We don't want to modify basepath, so make a copy. */
+	memcpy(pathnode, basepath, sizeof(IndexPath));
+
+	ukey = linitial_node(UniqueKey, root->query_uniquekeys);
+
+	Assert(prefix > 0);
+	pathnode->indexskipprefix = prefix;
+	pathnode->indexdistinct = true;
+	pathnode->path.uniquekeys = root->query_uniquekeys;
+
+	numDistinctRows = estimate_num_groups(root, ukey->exprs,
+										  pathnode->path.rows,
+										  NULL, NULL);
+
+	pathnode->path.total_cost = pathnode->path.startup_cost * numDistinctRows;
+	pathnode->path.rows = numDistinctRows;
+
+	return pathnode;
+}
+
 /*
  * create_agg_path
  *	  Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 0fdb8e9ada..368d2c3f27 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -272,6 +272,9 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanskip = (amroutine->amskip != NULL &&
+					amroutine->amgetskiptuple != NULL &&
+					amroutine->ambeginskipscan != NULL);
 			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6fc5cbc09a..99b42063b1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1001,6 +1001,15 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-skip-scan plans."),
+			NULL
+		},
+		&enable_indexskipscan,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a1acd46b61..8cdf3d0ee7 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -367,6 +367,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexskipscan = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index a3f22d7357..2756c1587b 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1009,7 +1009,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 
 	state->tupDesc = tupDesc;	/* assume we need not copy tupDesc */
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, NULL, NULL);
 
 	if (state->indexInfo->ii_Expressions != NULL)
 	{
@@ -1104,7 +1104,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	state->indexRel = indexRel;
 	state->enforceUnique = enforceUnique;
 
-	indexScanKey = _bt_mkscankey(indexRel, NULL);
+	indexScanKey = _bt_mkscankey(indexRel, NULL, NULL);
 
 	/* Prepare SortSupport data for each column */
 	state->sortKeys = (SortSupport) palloc0(state->nKeys *
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index a382551a98..cbcc373e01 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -162,6 +162,12 @@ typedef IndexScanDesc (*ambeginscan_function) (Relation indexRelation,
 											   int nkeys,
 											   int norderbys);
 
+/* prepare for index scan with skip */
+typedef IndexScanDesc (*ambeginscan_skip_function) (Relation indexRelation,
+											   int nkeys,
+											   int norderbys,
+											   int prefix);
+
 /* (re)start index scan */
 typedef void (*amrescan_function) (IndexScanDesc scan,
 								   ScanKey keys,
@@ -173,6 +179,16 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* next valid tuple */
+typedef bool (*amgettuple_with_skip_function) (IndexScanDesc scan,
+											   ScanDirection prefixDir,
+											   ScanDirection postfixDir);
+
+/* skip past duplicates */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+								 ScanDirection prefixDir,
+								 ScanDirection postfixDir);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -271,12 +287,15 @@ typedef struct IndexAmRoutine
 	amvalidate_function amvalidate;
 	amadjustmembers_function amadjustmembers;	/* can be NULL */
 	ambeginscan_function ambeginscan;
+	ambeginscan_skip_function ambeginskipscan;
 	amrescan_function amrescan;
 	amgettuple_function amgettuple; /* can be NULL */
+	amgettuple_with_skip_function amgetskiptuple; /* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
 	amrestrpos_function amrestrpos; /* can be NULL */
+	amskip_function amskip;				/* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 134b20f1e6..93db9139f8 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -152,9 +152,17 @@ extern IndexScanDesc index_beginscan(Relation heapRelation,
 									 Relation indexRelation,
 									 Snapshot snapshot,
 									 int nkeys, int norderbys);
+extern IndexScanDesc index_beginscan_skip(Relation heapRelation,
+									 Relation indexRelation,
+									 Snapshot snapshot,
+									 int nkeys, int norderbys, int prefix);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 											Snapshot snapshot,
 											int nkeys);
+extern IndexScanDesc index_beginscan_bitmap_skip(Relation indexRelation,
+											Snapshot snapshot,
+											int nkeys,
+											int prefix);
 extern void index_rescan(IndexScanDesc scan,
 						 ScanKey keys, int nkeys,
 						 ScanKey orderbys, int norderbys);
@@ -170,10 +178,16 @@ extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
 											  ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 									 ScanDirection direction);
+extern ItemPointer index_getnext_tid_skip(IndexScanDesc scan,
+									 ScanDirection prefixDir,
+									 ScanDirection postfixDir);
 struct TupleTableSlot;
 extern bool index_fetch_heap(IndexScanDesc scan, struct TupleTableSlot *slot);
 extern bool index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
 							   struct TupleTableSlot *slot);
+extern bool index_getnext_slot_skip(IndexScanDesc scan, ScanDirection prefixDir,
+									ScanDirection postfixDir,
+									struct TupleTableSlot *slot);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
@@ -183,6 +197,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
 												   IndexBulkDeleteResult *istat);
 extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection prefixDir,
+					   ScanDirection postfixDir);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9fec6fb1a8..d34ac4031f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1027,6 +1027,54 @@ typedef struct BTArrayKeyInfo
 	Datum	   *elem_values;	/* array of num_elems Datums */
 } BTArrayKeyInfo;
 
+typedef struct BTSkipCompareResult
+{
+	bool		equal;
+	int			prefixCmpResult, skCmpResult;
+	bool		prefixSkip, fullKeySkip;
+	int			prefixSkipIndex;
+} BTSkipCompareResult;
+
+typedef enum BTSkipState
+{
+	SkipStateStop,
+	SkipStateSkip,
+	SkipStateSkipExtra,
+	SkipStateNext
+} BTSkipState;
+
+typedef struct BTSkipPosData
+{
+	BTSkipState nextAction;
+	ScanDirection nextDirection;
+	int nextSkipIndex;
+	BTScanInsertData skipScanKey;
+	char skipTuple[BLCKSZ]; /* tuple data where skipScanKey Datum's point to */
+} BTSkipPosData;
+
+typedef struct BTSkipData
+{
+	/* used to control skipping
+	 * curPos.skipScanKey is a combination of currentTupleKey and fwdScanKey/bwdScanKey.
+	 * currentTupleKey contains the scan keys for the current tuple
+	 * fwdScanKey contains the scan keys for quals that would be chosen for a forward scan
+	 * bwdScanKey contains the scan keys for quals that would be chosen for a backward scan
+	 * we need both fwd and bwd, because the scan keys differ for going fwd and bwd
+	 * if a qual would be a>2 and a<5, fwd would have a>2, while bwd would have a<5
+	 */
+	BTScanInsertData	currentTupleKey;
+	BTScanInsertData	fwdScanKey;
+	ScanKeyData			fwdNotNullKeys[INDEX_MAX_KEYS];
+	BTScanInsertData	bwdScanKey;
+	ScanKeyData			bwdNotNullKeys[INDEX_MAX_KEYS];
+	/* length of prefix to skip */
+	int					prefix;
+
+	BTSkipPosData curPos, markPos;
+} BTSkipData;
+
+typedef BTSkipData *BTSkip;
+
 typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
@@ -1064,6 +1112,9 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* Work space for _bt_skip */
+	BTSkip	skipData;	/* used to control skipping */
+
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -1078,6 +1129,8 @@ typedef BTScanOpaqueData *BTScanOpaque;
  */
 #define SK_BT_REQFWD	0x00010000	/* required to continue forward scan */
 #define SK_BT_REQBKWD	0x00020000	/* required to continue backward scan */
+#define SK_BT_REQSKIPFWD	0x00040000	/* required to continue forward scan within current prefix */
+#define SK_BT_REQSKIPBKWD	0x00080000	/* required to continue backward scan within current prefix */
 #define SK_BT_INDOPTION_SHIFT  24	/* must clear the above bits */
 #define SK_BT_DESC			(INDOPTION_DESC << SK_BT_INDOPTION_SHIFT)
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
@@ -1124,9 +1177,12 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 					 bool indexUnchanged,
 					 struct IndexInfo *indexInfo);
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc btbeginscan_skip(Relation rel, int nkeys, int norderbys, int skipPrefix);
 extern Size btestimateparallelscan(void);
 extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
+extern bool btgettuple_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
+extern bool btskip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 					 ScanKey orderbys, int norderbys);
@@ -1227,15 +1283,81 @@ extern Buffer _bt_moveright(Relation rel, BTScanInsert key, Buffer buf,
 							bool forupdate, BTStack stack, int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
-extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_first(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
+extern bool _bt_next(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 							   Snapshot snapshot);
+extern Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
+extern OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+extern void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
+extern bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
+						 OffsetNumber *offnum, bool isRegularMode);
+extern bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
+extern void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+
+/*
+* prototypes for functions in nbtskip.c
+*/
+static inline bool
+_bt_skip_enabled(BTScanOpaque so)
+{
+	return so->skipData != NULL;
+}
+
+static inline bool
+_bt_skip_is_regular_mode(ScanDirection prefixDir, ScanDirection postfixDir)
+{
+	return prefixDir == postfixDir;
+}
+
+/* returns whether or not we can use extra quals in the scankey after skipping to a prefix */
+static inline bool
+_bt_has_extra_quals_after_skip(BTSkip skip, ScanDirection dir, int prefix)
+{
+	if (ScanDirectionIsForward(dir))
+	{
+		return skip->fwdScanKey.keysz > prefix;
+	}
+	else
+	{
+		return skip->bwdScanKey.keysz > prefix;
+	}
+}
+
+/* alias of BTScanPosIsValid */
+static inline bool
+_bt_skip_is_always_valid(BTScanOpaque so)
+{
+	return BTScanPosIsValid(so->currPos);
+}
+
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection prefixDir, ScanDirection postfixDir);
+extern void _bt_skip_create_scankeys(Relation rel, BTScanOpaque so);
+extern void _bt_skip_update_scankey_for_extra_skip(IndexScanDesc scan, Relation indexRel,
+					ScanDirection curDir, ScanDirection prefixDir, bool prioritizeEqual, IndexTuple itup);
+extern void _bt_skip_once(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+						  bool forceSkip, ScanDirection prefixDir, ScanDirection postfixDir);
+extern void _bt_skip_extra_conditions(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+									  ScanDirection prefixDir, ScanDirection postfixDir, BTSkipCompareResult *cmp);
+extern bool _bt_skip_find_next(IndexScanDesc scan, IndexTuple curTuple, OffsetNumber curTupleOffnum,
+							   ScanDirection prefixDir, ScanDirection postfixDir);
+extern void _bt_skip_until_match(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum,
+								 ScanDirection prefixDir, ScanDirection postfixDir);
+extern bool _bt_has_results(BTScanOpaque so);
+extern void _bt_compare_current_item(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+									 ScanDirection dir, bool isRegularMode, BTSkipCompareResult* cmp);
+extern bool _bt_step_back_page(IndexScanDesc scan, IndexTuple *curTuple, OffsetNumber *curTupleOffnum);
+extern bool _bt_step_forward_page(IndexScanDesc scan, BlockNumber next, IndexTuple *curTuple,
+								  OffsetNumber *curTupleOffnum);
+extern bool _bt_checkkeys_skip(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+							   ScanDirection dir, bool *continuescan, int *prefixskipindex);
+extern IndexTuple
+_bt_get_tuple_from_offset(BTScanOpaque so, OffsetNumber curTupleOffnum);
 
 /*
  * prototypes for functions in nbtutils.c
  */
-extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup);
 extern void _bt_freestack(BTStack stack);
 extern void _bt_preprocess_array_keys(IndexScanDesc scan);
 extern void _bt_start_array_keys(IndexScanDesc scan, ScanDirection dir);
@@ -1244,7 +1366,7 @@ extern void _bt_mark_array_keys(IndexScanDesc scan);
 extern void _bt_restore_array_keys(IndexScanDesc scan);
 extern void _bt_preprocess_keys(IndexScanDesc scan);
 extern bool _bt_checkkeys(IndexScanDesc scan, IndexTuple tuple,
-						  int tupnatts, ScanDirection dir, bool *continuescan);
+						  int tupnatts, ScanDirection dir, bool *continuescan, int *indexSkipPrefix);
 extern void _bt_killitems(IndexScanDesc scan);
 extern BTCycleId _bt_vacuum_cycleid(Relation rel);
 extern BTCycleId _bt_start_vacuum(Relation rel);
@@ -1266,6 +1388,19 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
 extern bool _bt_allequalimage(Relation rel, bool debugmessage);
+extern bool _bt_checkkeys_threeway(IndexScanDesc scan, IndexTuple tuple, int tupnatts,
+				ScanDirection dir, bool *continuescan, int *prefixSkipIndex);
+extern bool _bt_create_insertion_scan_key(Relation	rel, ScanDirection dir,
+				ScanKey* startKeys, int keysCount,
+				BTScanInsert inskey, StrategyNumber* stratTotal,
+				bool* goback);
+extern void _bt_set_bsearch_flags(StrategyNumber stratTotal, ScanDirection dir,
+		bool* nextkey, bool* goback);
+extern int _bt_choose_scan_keys(ScanKey scanKeys, int numberOfKeys, ScanDirection dir,
+ScanKey* startKeys, ScanKeyData* notnullkeys,
+  StrategyNumber* stratTotal, int prefix);
+extern BTScanInsert _bt_mkscankey(Relation rel, IndexTuple itup, BTScanInsert key);
+extern void print_itup(BlockNumber blk, IndexTuple left, IndexTuple right, Relation rel, char *extra);
 
 /*
  * prototypes for functions in nbtvalidate.c
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 344399f6a8..9bf3922d58 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -455,9 +455,13 @@ extern Datum ExecMakeFunctionResultSet(SetExprState *fcache,
  */
 typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
 typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
+typedef bool (*ExecScanSkipMtd) (ScanState *node);
 
 extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
 								ExecScanRecheckMtd recheckMtd);
+extern TupleTableSlot *ExecScanExtended(ScanState *node, ExecScanAccessMtd accessMtd,
+								ExecScanRecheckMtd recheckMtd,
+								ExecScanSkipMtd skipMtd);
 extern void ExecAssignScanProjectionInfo(ScanState *node);
 extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, int varno);
 extern void ExecScanReScan(ScanState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4ea8735dd8..1aa67c5ae6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1383,6 +1383,7 @@ typedef struct ScanState
 	Relation	ss_currentRelation;
 	struct TableScanDescData *ss_currentScanDesc;
 	TupleTableSlot *ss_ScanTupleSlot;
+	bool ss_FirstTupleEmitted;
 } ScanState;
 
 /* ----------------
@@ -1479,6 +1480,8 @@ typedef struct IndexScanState
 	ExprContext *iss_RuntimeContext;
 	Relation	iss_RelationDesc;
 	struct IndexScanDescData *iss_ScanDesc;
+	int			iss_SkipPrefixSize;
+	bool		iss_Distinct;
 
 	/* These are needed for re-checking ORDER BY expr ordering */
 	pairingheap *iss_ReorderQueue;
@@ -1508,6 +1511,8 @@ typedef struct IndexScanState
  *		TableSlot		   slot for holding tuples fetched from the table
  *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
+ *		SkipPrefixSize	   number of keys for skip-based DISTINCT
+ *		FirstTupleEmitted  has the first tuple been emitted
  * ----------------
  */
 typedef struct IndexOnlyScanState
@@ -1526,6 +1531,8 @@ typedef struct IndexOnlyScanState
 	struct IndexScanDescData *ioss_ScanDesc;
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
+	int			ioss_SkipPrefixSize;
+	bool		ioss_Distinct;
 	Size		ioss_PscanLen;
 } IndexOnlyScanState;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 1de5095e74..22704fa5f0 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -699,6 +699,7 @@ typedef struct RelOptInfo
 	List       *unique_pathlist;    /* unique Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
+	struct Path *cheapest_distinct_unique_path;
 	struct Path *cheapest_unique_path;
 	List	   *cheapest_parameterized_paths;
 
@@ -1267,6 +1268,9 @@ typedef struct Path
  * we need not recompute them when considering using the same index in a
  * bitmap index/heap scan (see BitmapHeapPath).  The costs of the IndexPath
  * itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
  *----------
  */
 typedef struct IndexPath
@@ -1279,6 +1283,8 @@ typedef struct IndexPath
 	ScanDirection indexscandir;
 	Cost		indextotalcost;
 	Selectivity indexselectivity;
+	int			indexskipprefix;
+	bool		indexdistinct;
 } IndexPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 0b518ce6b2..817b0be1fa 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -411,6 +411,8 @@ typedef struct IndexScan
 	List	   *indexorderbyorig;	/* the same in original form */
 	List	   *indexorderbyops;	/* OIDs of sort ops for ORDER BY exprs */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for skip scans */
+	bool		indexdistinct; /* whether only distinct keys are requested */
 } IndexScan;
 
 /* ----------------
@@ -453,6 +455,8 @@ typedef struct IndexOnlyScan
 	List	   *indexorderby;	/* list of index ORDER BY exprs */
 	List	   *indextlist;		/* TargetEntry list describing index's cols */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for skip scans */
+	bool		indexdistinct; /* whether only distinct keys are requested */
 } IndexOnlyScan;
 
 /* ----------------
@@ -479,6 +483,7 @@ typedef struct BitmapIndexScan
 	bool		isshared;		/* Create shared bitmap if set */
 	List	   *indexqual;		/* list of index quals (OpExprs) */
 	List	   *indexqualorig;	/* the same in original form */
+	int			indexskipprefixsize;	/* the size of the prefix for skip scans */
 } BitmapIndexScan;
 
 /* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 356a51f370..03d5816c82 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
 extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index bb6d730e93..227cda4bd7 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -218,6 +218,10 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
 												 Path *subpath,
 												 int numCols,
 												 double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+											  IndexOptInfo *index,
+											  Path *subpath,
+											  int prefix);
 extern AggPath *create_agg_path(PlannerInfo *root,
 								RelOptInfo *rel,
 								Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 16bb5e0eea..36e98b660c 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -212,6 +212,10 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   Relids required_outer,
 													   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
+extern int find_index_prefix_for_pathkey(PlannerInfo *root,
+					 IndexOptInfo *index,
+					 ScanDirection scandir,
+					 PathKey *pathkey);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
 								  ScanDirection scandir);
 extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
diff --git a/src/interfaces/libpq/encnames.c b/src/interfaces/libpq/encnames.c
new file mode 120000
index 0000000000..ca78618b55
--- /dev/null
+++ b/src/interfaces/libpq/encnames.c
@@ -0,0 +1 @@
+../../../src/backend/utils/mb/encnames.c
\ No newline at end of file
diff --git a/src/interfaces/libpq/wchar.c b/src/interfaces/libpq/wchar.c
new file mode 120000
index 0000000000..a27508f72a
--- /dev/null
+++ b/src/interfaces/libpq/wchar.c
@@ -0,0 +1 @@
+../../../src/backend/utils/mb/wchar.c
\ No newline at end of file
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index 58122c6f88..ec98dbf63b 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -375,3 +375,602 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
  t
 (1 row)
 
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+SELECT DISTINCT a FROM distinct_a;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+ a 
+---
+ 1
+(1 row)
+
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Index Only Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: Distinct only
+(2 rows)
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: Distinct only
+   Index Cond: (b = 2)
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: Distinct only
+   Index Cond: (b = 2)
+(3 rows)
+
+DROP INDEX distinct_a_b_a;
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b 
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+                          QUERY PLAN                          
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: Distinct only
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+                              QUERY PLAN                               
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: Distinct only
+   Index Cond: (c = 2)
+(3 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+ 2 | 1 | 10
+ 3 | 1 | 10
+ 4 | 1 | 10
+ 5 | 1 | 10
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: Distinct only
+(2 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: Distinct only
+   Index Cond: (a = 1)
+(3 rows)
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: Distinct only
+   Index Cond: (b = 2)
+   Filter: (c = 10)
+(4 rows)
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+ a | a 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 3
+ 4 | 4
+ 5 | 5
+(5 rows)
+
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+ a | ?column? 
+---+----------
+ 1 |        1
+ 2 |        1
+ 3 |        1
+ 4 |        1
+ 5 |        1
+(5 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+FETCH FROM c;
+ a 
+---
+ 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a 
+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 |  9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+                                 QUERY PLAN                                 
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+   Skip scan: Distinct only
+   Index Cond: ((b >= 1) AND (c = 0))
+(3 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c 
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
+-- test tuple killing
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 1 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 5 | 1 | 1 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 5 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 1 | 1 | 1 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(5 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(5 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 2088857615..cd564b0316 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -110,6 +110,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexskipscan           | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -122,7 +123,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(20 rows)
+(21 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index 1bfe59c26f..708aa2a746 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -174,3 +174,251 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
 SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
 SELECT 2 IS NOT DISTINCT FROM null as "no";
 SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+ANALYZE distinct_a;
+
+SELECT DISTINCT a FROM distinct_a;
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+DROP INDEX distinct_a_b_a;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
+
+-- test tuple killing
+
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
-- 
2.33.1

v3-0003-Support-skip-scan-for-non-distinct-scans.patchapplication/octet-stream; name=v3-0003-Support-skip-scan-for-non-distinct-scans.patchDownload

From d4360c041b625d32c1b37d9d9d26e2e03b3d4ad4 Mon Sep 17 00:00:00 2001
From: Floris van Nee <floris.vannee@gmail.com>
Date: Thu, 19 Mar 2020 10:27:47 +0100
Subject: [PATCH 5/5] Support skip scan for non-distinct scans

Adds planner support to choose a skip scan for regular
non-distinct queries like:
SELECT * FROM t1 WHERE b=1 (with index on (a,b))
---
 src/backend/optimizer/path/indxpath.c | 181 +++++++++++++++++++++++++-
 src/backend/optimizer/plan/planner.c  |   2 +-
 src/backend/optimizer/util/pathnode.c |   4 +-
 src/backend/utils/adt/selfuncs.c      | 153 ++++++++++++++++++++--
 src/include/optimizer/pathnode.h      |   3 +-
 5 files changed, 327 insertions(+), 16 deletions(-)

diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 7c8889a350..5538e01c28 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -192,6 +192,17 @@ static Expr *match_clause_to_ordering_op(IndexOptInfo *index,
 static bool ec_member_matches_indexcol(PlannerInfo *root, RelOptInfo *rel,
 									   EquivalenceClass *ec, EquivalenceMember *em,
 									   void *arg);
+static List* add_possible_index_skip_paths(List* result,
+										  PlannerInfo *root,
+										  IndexOptInfo *index,
+										  List *indexclauses,
+										  List *indexorderbys,
+										  List *indexorderbycols,
+										  List *pathkeys,
+										  ScanDirection indexscandir,
+										  bool indexonly,
+										  Relids required_outer,
+										  double loop_count);
 
 
 /*
@@ -820,6 +831,136 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	}
 }
 
+/*
+ * Find available index skip paths and add them to the path list
+ */
+static List* add_possible_index_skip_paths(List* result,
+										  PlannerInfo *root,
+										  IndexOptInfo *index,
+										  List *indexclauses,
+										  List *indexorderbys,
+										  List *indexorderbycols,
+										  List *pathkeys,
+										  ScanDirection indexscandir,
+										  bool indexonly,
+										  Relids required_outer,
+										  double loop_count)
+{
+	int			indexcol;
+	bool		eqQualHere;
+	bool		eqQualPrev;
+	bool		eqSoFar;
+	ListCell   *lc;
+
+	/*
+	 * We need to find possible prefixes to use for the skip scan
+	 * Any useful prefix is one just before an index clause, unless
+	 * all clauses so far have been equal.
+	 * For example, on an index (a,b,c), the qual b=1 would
+	 * mean that an interesting skip prefix could be 1.
+	 * For qual a=1 AND b=1, it is not interesting to skip with
+	 * prefix 1, because the value of a is fixed already.
+	 */
+	indexcol = 0;
+	eqQualHere = false;
+	eqQualPrev = false;
+	eqSoFar = true;
+	foreach(lc, indexclauses)
+	{
+		IndexClause *iclause = lfirst_node(IndexClause, lc);
+		ListCell   *lc2;
+
+		if (indexcol != iclause->indexcol)
+		{
+			if (!eqQualHere || indexcol != iclause->indexcol - 1)
+				eqSoFar = false;
+
+			/* Beginning of a new column's quals */
+			if (!eqQualPrev && !eqSoFar)
+			{
+				/* We have a qual on current column,
+				 * there is no equality qual on the previous column,
+				 * not all of the previous quals are equality so far
+				 * (last one is special case for the first column in the index).
+				 * Optimal conditions to try an index skip path.
+				 */
+				IndexPath *ipath = create_index_path(root, index,
+										  indexclauses,
+										  indexorderbys,
+										  indexorderbycols,
+										  pathkeys,
+										  indexscandir,
+										  indexonly,
+										  required_outer,
+										  loop_count,
+										  false,
+										  iclause->indexcol);
+				result = lappend(result, ipath);
+			}
+
+			eqQualPrev = eqQualHere;
+			eqQualHere = false;
+			indexcol++;
+			/* if the clause is not for this index col, increment until it is */
+			while (indexcol != iclause->indexcol)
+			{
+				eqQualPrev = false;
+				eqSoFar = false;
+				indexcol++;
+			}
+		}
+
+		/* Examine each indexqual associated with this index clause */
+		foreach(lc2, iclause->indexquals)
+		{
+			RestrictInfo *rinfo = lfirst_node(RestrictInfo, lc2);
+			Expr	   *clause = rinfo->clause;
+			Oid			clause_op = InvalidOid;
+			int			op_strategy;
+
+			if (IsA(clause, OpExpr))
+			{
+				OpExpr	   *op = (OpExpr *) clause;
+				clause_op = op->opno;
+			}
+			else if (IsA(clause, RowCompareExpr))
+			{
+				RowCompareExpr *rc = (RowCompareExpr *) clause;
+				clause_op = linitial_oid(rc->opnos);
+			}
+			else if (IsA(clause, ScalarArrayOpExpr))
+			{
+				ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) clause;
+				clause_op = saop->opno;
+			}
+			else if (IsA(clause, NullTest))
+			{
+				NullTest   *nt = (NullTest *) clause;
+
+				if (nt->nulltesttype == IS_NULL)
+				{
+					/* IS NULL is like = for selectivity purposes */
+					eqQualHere = true;
+				}
+			}
+			else
+				elog(ERROR, "unsupported indexqual type: %d",
+					 (int) nodeTag(clause));
+
+			/* check for equality operator */
+			if (OidIsValid(clause_op))
+			{
+				op_strategy = get_op_opfamily_strategy(clause_op,
+													   index->opfamily[indexcol]);
+				Assert(op_strategy != 0);	/* not a member of opfamily?? */
+				if (op_strategy == BTEqualStrategyNumber)
+					eqQualHere = true;
+			}
+		}
+	}
+	return result;
+}
+
 /*
  * build_index_paths
  *	  Given an index and a set of index clauses for it, construct zero
@@ -1055,9 +1196,25 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  index_only_scan,
 								  outer_relids,
 								  loop_count,
-								  false);
+								  false,
+								  0);
 		result = lappend(result, ipath);
 
+		if (can_skip)
+		{
+			result = add_possible_index_skip_paths(result, root, index,
+												   index_clauses,
+												   orderbyclauses,
+												   orderbyclausecols,
+												   useful_pathkeys,
+												   index_is_ordered ?
+												   ForwardScanDirection :
+												   NoMovementScanDirection,
+												   index_only_scan,
+												   outer_relids,
+												   loop_count);
+		}
+
 		/* Consider index skip scan as well */
 		if (root->query_uniquekeys != NULL && can_skip)
 		{
@@ -1104,7 +1261,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  index_only_scan,
 									  outer_relids,
 									  loop_count,
-									  true);
+									  true,
+									  0);
 
 			/*
 			 * if, after costing the path, we find that it's not worth using
@@ -1137,9 +1295,23 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  index_only_scan,
 									  outer_relids,
 									  loop_count,
-									  false);
+									  false,
+									  0);
 			result = lappend(result, ipath);
 
+			if (can_skip)
+			{
+				result = add_possible_index_skip_paths(result, root, index,
+													   index_clauses,
+													   NIL,
+													   NIL,
+													   useful_pathkeys,
+													   BackwardScanDirection,
+													   index_only_scan,
+													   outer_relids,
+													   loop_count);
+			}
+
 			/* Consider index skip scan as well */
 			if (root->query_uniquekeys != NULL && can_skip)
 			{
@@ -1181,7 +1353,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 										  index_only_scan,
 										  outer_relids,
 										  loop_count,
-										  true);
+										  true,
+										  0);
 
 				/*
 				 * if, after costing the path, we find that it's not worth
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 6a7ac49d6e..05b20b2af3 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6013,7 +6013,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 	indexScanPath = create_index_path(root, indexInfo,
 									  NIL, NIL, NIL, NIL,
 									  ForwardScanDirection, false,
-									  NULL, 1.0, false);
+									  NULL, 1.0, false, 0);
 
 	return (seqScanAndSortPath.total_cost < indexScanPath->path.total_cost);
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index ac321bf31d..aff0521015 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1051,7 +1051,8 @@ create_index_path(PlannerInfo *root,
 				  bool indexonly,
 				  Relids required_outer,
 				  double loop_count,
-				  bool partial_path)
+				  bool partial_path,
+				  int skip_prefix)
 {
 	IndexPath  *pathnode = makeNode(IndexPath);
 	RelOptInfo *rel = index->rel;
@@ -1071,6 +1072,7 @@ create_index_path(PlannerInfo *root,
 	pathnode->indexorderbys = indexorderbys;
 	pathnode->indexorderbycols = indexorderbycols;
 	pathnode->indexscandir = indexscandir;
+	pathnode->indexskipprefix = skip_prefix;
 
 	cost_index(pathnode, root, loop_count, partial_path);
 
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index aff748d67b..97db74adf9 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -210,7 +210,9 @@ static bool get_actual_variable_endpoint(Relation heapRel,
 										 MemoryContext outercontext,
 										 Datum *endpointDatum);
 static RelOptInfo *find_join_input_rel(PlannerInfo *root, Relids relids);
-
+static double estimate_num_groups_internal(PlannerInfo *root, List *groupExprs,
+									double input_rows, double rel_input_rows,
+									List **pgset, EstimationInfo *estinfo);
 
 /*
  *		eqsel			- Selectivity of "=" for any data types.
@@ -3367,6 +3369,19 @@ add_unique_group_var(PlannerInfo *root, List *varinfos,
 double
 estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 					List **pgset, EstimationInfo *estinfo)
+{
+	return estimate_num_groups_internal(root, groupExprs, input_rows, -1, pgset, estinfo);
+}
+
+/*
+ * Same as estimate_num_groups, but with an extra argument to control
+ * the estimation used for the input rows of the relation. If
+ * rel_input_rows < 0, it uses the the original planner estimation for the
+ * individual rels, else if uses the estimation as provided to the function.
+ */
+static double
+estimate_num_groups_internal(PlannerInfo *root, List *groupExprs, double input_rows, double rel_input_rows,
+					List **pgset, EstimationInfo *estinfo)
 {
 	List	   *varinfos = NIL;
 	double		srf_multiplier = 1.0;
@@ -3533,6 +3548,12 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 		int			relvarcount = 0;
 		List	   *newvarinfos = NIL;
 		List	   *relvarinfos = NIL;
+		double this_rel_input_rows;
+
+		if (rel_input_rows < 0.0)
+			this_rel_input_rows = rel->rows;
+		else
+			this_rel_input_rows = rel_input_rows;
 
 		/*
 		 * Split the list of varinfos in two - one for the current rel, one
@@ -3638,7 +3659,7 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 			 * guarding against division by zero when reldistinct is zero.
 			 * Also skip this if we know that we are returning all rows.
 			 */
-			if (reldistinct > 0 && rel->rows < rel->tuples)
+			if (reldistinct > 0 && this_rel_input_rows < rel->tuples)
 			{
 				/*
 				 * Given a table containing N rows with n distinct values in a
@@ -3675,7 +3696,7 @@ estimate_num_groups(PlannerInfo *root, List *groupExprs, double input_rows,
 				 * works well even when n is small.
 				 */
 				reldistinct *=
-					(1 - pow((rel->tuples - rel->rows) / rel->tuples,
+					(1 - pow((rel->tuples - this_rel_input_rows) / rel->tuples,
 							 rel->tuples / reldistinct));
 			}
 			reldistinct = clamp_row_est(reldistinct);
@@ -6621,8 +6642,10 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	double		numIndexTuples;
 	Cost		descentCost;
 	List	   *indexBoundQuals;
+	List	   *prefixBoundQuals;
 	int			indexcol;
 	bool		eqQualHere;
+	bool		stillEq;
 	bool		found_saop;
 	bool		found_is_null_op;
 	double		num_sa_scans;
@@ -6646,9 +6669,11 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 * considered to act the same as it normally does.
 	 */
 	indexBoundQuals = NIL;
+	prefixBoundQuals = NIL;
 	indexcol = 0;
 	eqQualHere = false;
 	found_saop = false;
+	stillEq = true;
 	found_is_null_op = false;
 	num_sa_scans = 1;
 	foreach(lc, path->indexclauses)
@@ -6660,11 +6685,18 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 		{
 			/* Beginning of a new column's quals */
 			if (!eqQualHere)
-				break;			/* done if no '=' qual for indexcol */
+			{
+				stillEq = false;
+				/* done if no '=' qual for indexcol and we're past the skip prefix */
+				if (path->indexskipprefix <= indexcol)
+					break;
+			}
 			eqQualHere = false;
 			indexcol++;
+			while (indexcol != iclause->indexcol && path->indexskipprefix > indexcol)
+				indexcol++;
 			if (indexcol != iclause->indexcol)
-				break;			/* no quals at all for indexcol */
+				break; /* no quals at all for indexcol */
 		}
 
 		/* Examine each indexqual associated with this index clause */
@@ -6696,7 +6728,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 				clause_op = saop->opno;
 				found_saop = true;
 				/* count number of SA scans induced by indexBoundQuals only */
-				if (alength > 1)
+				if (alength > 1 && stillEq)
 					num_sa_scans *= alength;
 			}
 			else if (IsA(clause, NullTest))
@@ -6724,7 +6756,14 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 					eqQualHere = true;
 			}
 
-			indexBoundQuals = lappend(indexBoundQuals, rinfo);
+			/* we keep two lists here, one with all quals up until the prefix
+			 * and one with only the quals until the first inequality.
+			 * we need the list with prefixes later
+			 */
+			if (stillEq)
+				indexBoundQuals = lappend(indexBoundQuals, rinfo);
+			if (path->indexskipprefix > 0)
+				prefixBoundQuals = lappend(prefixBoundQuals, rinfo);
 		}
 	}
 
@@ -6750,7 +6789,10 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 		 * index-bound quals to produce a more accurate idea of the number of
 		 * rows covered by the bound conditions.
 		 */
-		selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
+		if (path->indexskipprefix > 0)
+			selectivityQuals = add_predicate_to_index_quals(index, prefixBoundQuals);
+		else
+			selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
 
 		btreeSelectivity = clauselist_selectivity(root, selectivityQuals,
 												  index->rel->relid,
@@ -6760,7 +6802,7 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 
 		/*
 		 * As in genericcostestimate(), we have to adjust for any
-		 * ScalarArrayOpExpr quals included in indexBoundQuals, and then round
+		 * ScalarArrayOpExpr quals included in prefixBoundQuals, and then round
 		 * to integer.
 		 */
 		numIndexTuples = rint(numIndexTuples / num_sa_scans);
@@ -6806,6 +6848,99 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	costs.indexStartupCost += descentCost;
 	costs.indexTotalCost += costs.num_sa_scans * descentCost;
 
+	/*
+	 * Add extra costs for using an index skip scan.
+	 * The index skip scan could have significantly lower cost until now,
+	 * due to the different row estimation used (all the quals up to prefix,
+	 * rather than all the quals up to the first non-equality operator).
+	 * However, there are extra costs incurred for
+	 * a) setting up the scan
+	 * b) doing additional scans from root
+	 * c) small extra cost per tuple comparison
+	 * We add those here
+	 */
+	if (path->indexskipprefix > 0)
+	{
+		List *exprlist = NULL;
+		double numgroups_estimate;
+		int i = 0;
+		ListCell *indexpr_item = list_head(path->indexinfo->indexprs);
+		List	   *selectivityQuals;
+		Selectivity btreeSelectivity;
+		double estimatedIndexTuplesNoPrefix;
+
+		/* some rather arbitrary extra cost for preprocessing structures needed for skip scan */
+		costs.indexStartupCost += 200.0 * cpu_operator_cost;
+		costs.indexTotalCost += 200.0 * cpu_operator_cost;
+
+		/*
+		 * In order to reliably get a cost estimation for the number of scans we have to do from root,
+		 * we need some estimation on the number of distinct prefixes that exist. Therefore, we need
+		 * a different selectivity approximation (this time we do need to use the clauses until the first
+		 * non-equality operator). Using that, we can estimate the number of groups
+		 */
+		for (i = 0; i < path->indexinfo->nkeycolumns && i < path->indexskipprefix; i++)
+		{
+			Expr *expr = NULL;
+			int attr = path->indexinfo->indexkeys[i];
+			if(attr > 0)
+			{
+				TargetEntry *tentry = get_tle_by_resno(path->indexinfo->indextlist, i + 1);
+				Assert(tentry != NULL);
+				expr = tentry->expr;
+			}
+			else if (attr == 0)
+			{
+				/* Expression index */
+				expr = lfirst(indexpr_item);
+				indexpr_item = lnext(path->indexinfo->indexprs, indexpr_item);
+			}
+			else /* attr < 0 */
+			{
+				/* Index on system column is not supported */
+				Assert(false);
+			}
+
+			exprlist = lappend(exprlist, expr);
+		}
+
+		selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
+
+		btreeSelectivity = clauselist_selectivity(root, selectivityQuals,
+												  index->rel->relid,
+												  JOIN_INNER,
+												  NULL);
+		estimatedIndexTuplesNoPrefix = btreeSelectivity * index->rel->tuples;
+
+		/*
+		 * As in genericcostestimate(), we have to adjust for any
+		 * ScalarArrayOpExpr quals included in prefixBoundQuals, and then round
+		 * to integer.
+		 */
+		estimatedIndexTuplesNoPrefix = rint(estimatedIndexTuplesNoPrefix / num_sa_scans);
+
+		numgroups_estimate = estimate_num_groups_internal(
+					root, exprlist, estimatedIndexTuplesNoPrefix,
+					estimatedIndexTuplesNoPrefix, NULL, NULL);
+
+		/*
+		 * For each distinct prefix value we add descending cost as.
+		 * This is similar to the startup cost calculation for regular scans.
+		 * We can do at most 2 scans from root per distinct prefix, so multiply by 2.
+		 * Also add some CPU processing cost per page that we need to process, plus
+		 * some additional one-time cost for scanning the leaf page. This is a more
+		 * expensive estimation than the per-page cpu cost for the regular index scan.
+		 * This is intentional, because the index skip scan does more processing on
+		 * the leaf page.
+		 */
+		if (index->tuples > 0)
+			descentCost = ceil(log(index->tuples) / log(2.0)) * cpu_operator_cost * 2;
+		else
+			descentCost = 0;
+		descentCost += (index->tree_height + 1) * 50.0 * cpu_operator_cost * 2 + 200 * cpu_operator_cost;
+		costs.indexTotalCost += costs.num_sa_scans * descentCost * numgroups_estimate;
+	}
+
 	/*
 	 * If we can get an estimate of the first column's ordering correlation C
 	 * from pg_statistic, estimate the index correlation as C for a
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 227cda4bd7..e63d0abd96 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -49,7 +49,8 @@ extern IndexPath *create_index_path(PlannerInfo *root,
 									bool indexonly,
 									Relids required_outer,
 									double loop_count,
-									bool partial_path);
+									bool partial_path,
+									int skip_prefix);
 extern BitmapHeapPath *create_bitmap_heap_path(PlannerInfo *root,
 											   RelOptInfo *rel,
 											   Path *bitmapqual,
-- 
2.33.1

Dmitry Dolgov

9erthalion6@gmail.com

almost 4 years ago

In reply to: Floris Van Nee (#6)

Re: MDAM techniques and Index Skip Scan patch

On Thu, Jan 13, 2022 at 03:27:08PM +0000, Floris Van Nee wrote:

Could you send a rebased version? In the meantime I will change the status
on the cf app to Waiting on Author.

Attached a rebased version.

FYI, I've attached this thread to the CF item as an informational one,
but as there are some patches posted here, folks may get confused. For
those who have landed here with no context, I feel obliged to mention
that now there are two alternative patch series posted under the same
CF item:

* the original one lives in [1]/messages/by-id/20200609102247.jdlatmfyeecg52fi@localhost, waiting for reviews since the last May
* an alternative one posted here from Floris

[1]: /messages/by-id/20200609102247.jdlatmfyeecg52fi@localhost

Julien Rouhaud

rjuju123@gmail.com

almost 4 years ago

In reply to: Dmitry Dolgov (#7)

Re: MDAM techniques and Index Skip Scan patch

Hi,

On Fri, Jan 14, 2022 at 08:55:26AM +0100, Dmitry Dolgov wrote:

FYI, I've attached this thread to the CF item as an informational one,
but as there are some patches posted here, folks may get confused. For
those who have landed here with no context, I feel obliged to mention
that now there are two alternative patch series posted under the same
CF item:

* the original one lives in [1], waiting for reviews since the last May
* an alternative one posted here from Floris

Ah, I indeed wasn't sure of which patchset(s) should actually be reviewed.
It's nice to have the alternative approach threads linkied in the commit fest,
but it seems that the cfbot will use the most recent attachments as the only
patchset, thus leaving the "original" one untested.

I'm not sure of what's the best approach in such situation. Maybe creating a
different CF entry for each alternative, and link the other cf entry on the cf
app using the "Add annotations" or "Links" feature rather than attaching
threads?

Dmitry Dolgov

9erthalion6@gmail.com

almost 4 years ago

In reply to: Julien Rouhaud (#8)

Re: MDAM techniques and Index Skip Scan patch

On Fri, Jan 14, 2022 at 04:03:41PM +0800, Julien Rouhaud wrote:
Hi,

On Fri, Jan 14, 2022 at 08:55:26AM +0100, Dmitry Dolgov wrote:

FYI, I've attached this thread to the CF item as an informational one,
but as there are some patches posted here, folks may get confused. For
those who have landed here with no context, I feel obliged to mention
that now there are two alternative patch series posted under the same
CF item:

* the original one lives in [1], waiting for reviews since the last May
* an alternative one posted here from Floris

Ah, I indeed wasn't sure of which patchset(s) should actually be reviewed.
It's nice to have the alternative approach threads linkied in the commit fest,
but it seems that the cfbot will use the most recent attachments as the only
patchset, thus leaving the "original" one untested.

I'm not sure of what's the best approach in such situation. Maybe creating a
different CF entry for each alternative, and link the other cf entry on the cf
app using the "Add annotations" or "Links" feature rather than attaching
threads?

I don't mind having all of the alternatives under the same CF item, only
one being tested seems to be only a small limitation of cfbot.

#10

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Dmitry Dolgov (#9)

Re: MDAM techniques and Index Skip Scan patch

Hi,

On 2022-01-22 22:37:19 +0100, Dmitry Dolgov wrote:

On Fri, Jan 14, 2022 at 04:03:41PM +0800, Julien Rouhaud wrote:
Hi,

On Fri, Jan 14, 2022 at 08:55:26AM +0100, Dmitry Dolgov wrote:

FYI, I've attached this thread to the CF item as an informational one,
but as there are some patches posted here, folks may get confused. For
those who have landed here with no context, I feel obliged to mention
that now there are two alternative patch series posted under the same
CF item:

* the original one lives in [1], waiting for reviews since the last May
* an alternative one posted here from Floris

Ah, I indeed wasn't sure of which patchset(s) should actually be reviewed.
It's nice to have the alternative approach threads linkied in the commit fest,
but it seems that the cfbot will use the most recent attachments as the only
patchset, thus leaving the "original" one untested.

I'm not sure of what's the best approach in such situation. Maybe creating a
different CF entry for each alternative, and link the other cf entry on the cf
app using the "Add annotations" or "Links" feature rather than attaching
threads?

I don't mind having all of the alternatives under the same CF item, only
one being tested seems to be only a small limitation of cfbot.

IMO it's pretty clear that having "duelling" patches below one CF entry is a
bad idea. I think they should be split, with inactive approaches marked as
returned with feeback or whatnot.

Either way, currently this patch fails on cfbot due to a new GUC:
https://api.cirrus-ci.com/v1/artifact/task/5134905372835840/log/src/test/recovery/tmp_check/regression.diffs
https://cirrus-ci.com/task/5134905372835840

Greetings,

Andres Freund

#11

Dmitry Dolgov

9erthalion6@gmail.com

almost 4 years ago

In reply to: Andres Freund (#10)

6 attachment(s)

Re: MDAM techniques and Index Skip Scan patch

On Mon, Mar 21, 2022 at 06:34:09PM -0700, Andres Freund wrote:

I don't mind having all of the alternatives under the same CF item, only
one being tested seems to be only a small limitation of cfbot.

IMO it's pretty clear that having "duelling" patches below one CF entry is a
bad idea. I think they should be split, with inactive approaches marked as
returned with feeback or whatnot.

On the other hand even for patches with dependencies (i.e. the patch A
depends on the patch B) different CF items cause a lot of confusion for
reviewers. I guess for various flavours of the same patch it would be
even worse. But I don't have a strong opinion here.

Either way, currently this patch fails on cfbot due to a new GUC:
https://api.cirrus-ci.com/v1/artifact/task/5134905372835840/log/src/test/recovery/tmp_check/regression.diffs
https://cirrus-ci.com/task/5134905372835840

This seems to be easy to solve.

Attachments:

v41-0001-Unique-expressions.patchtext/x-diff; charset=us-asciiDownload

From 5bae9fdf8b74e5996b606e78f8b2a5fb327e011b Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Mon, 17 May 2021 11:47:07 +0200
Subject: [PATCH v41 1/6] Unique expressions

Extend PlannerInfo and Path structures with the list of relevant unique
expressions. It specifies which keys must be unique on the query
level, and allows to leverage this into on the path level. At the moment
only distinctClause makes use of such mechanism, which enables potential
use of index skip scan.

Originally proposed by David Rowley, based on the UniqueKey patch
implementation from Andy Fan, contains few bits out of previous version
from Jesper Pedersen, Floris Van Nee.
---
 src/backend/nodes/list.c                | 31 +++++++++
 src/backend/optimizer/path/Makefile     |  3 +-
 src/backend/optimizer/path/pathkeys.c   | 62 +++++++++++++++++
 src/backend/optimizer/path/uniquekeys.c | 92 +++++++++++++++++++++++++
 src/backend/optimizer/plan/planner.c    | 36 +++++++++-
 src/backend/optimizer/util/pathnode.c   | 32 ++++++---
 src/include/nodes/pathnodes.h           |  5 ++
 src/include/nodes/pg_list.h             |  1 +
 src/include/optimizer/pathnode.h        |  1 +
 src/include/optimizer/paths.h           |  9 +++
 10 files changed, 261 insertions(+), 11 deletions(-)
 create mode 100644 src/backend/optimizer/path/uniquekeys.c

diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c
index f843f861ef..a53a50f372 100644
--- a/src/backend/nodes/list.c
+++ b/src/backend/nodes/list.c
@@ -1653,3 +1653,34 @@ list_oid_cmp(const ListCell *p1, const ListCell *p2)
 		return 1;
 	return 0;
 }
+
+/*
+ * Return true iff every entry in "members" list is also present
+ * in the "target" list.
+ */
+bool
+list_is_subset(const List *members, const List *target)
+{
+	const ListCell	*lc1, *lc2;
+
+	Assert(IsPointerList(members));
+	Assert(IsPointerList(target));
+	check_list_invariants(members);
+	check_list_invariants(target);
+
+	foreach(lc1, members)
+	{
+		bool found = false;
+		foreach(lc2, target)
+		{
+			if (equal(lfirst(lc1), lfirst(lc2)))
+			{
+				found = true;
+				break;
+			}
+		}
+		if (!found)
+			return false;
+	}
+	return true;
+}
diff --git a/src/backend/optimizer/path/Makefile b/src/backend/optimizer/path/Makefile
index 1e199ff66f..7b9820c25f 100644
--- a/src/backend/optimizer/path/Makefile
+++ b/src/backend/optimizer/path/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	joinpath.o \
 	joinrels.o \
 	pathkeys.o \
-	tidpath.o
+	tidpath.o \
+	uniquekeys.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index 86a35cdef1..e2be1fbf90 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -29,6 +29,7 @@
 #include "utils/lsyscache.h"
 
 
+static bool pathkey_is_unique(PathKey *new_pathkey, List *pathkeys);
 static bool pathkey_is_redundant(PathKey *new_pathkey, List *pathkeys);
 static bool matches_boolean_partition_clause(RestrictInfo *rinfo,
 											 RelOptInfo *partrel,
@@ -96,6 +97,29 @@ make_canonical_pathkey(PlannerInfo *root,
 	return pk;
 }
 
+/*
+ * pathkey_is_unique
+ *		Checks if the new pathkey's equivalence class is the same as that of
+ *		any existing member of the pathkey list.
+ */
+static bool
+pathkey_is_unique(PathKey *new_pathkey, List *pathkeys)
+{
+	EquivalenceClass *new_ec = new_pathkey->pk_eclass;
+	ListCell   *lc;
+
+	/* If same EC already is already in the list, then not unique */
+	foreach(lc, pathkeys)
+	{
+		PathKey    *old_pathkey = (PathKey *) lfirst(lc);
+
+		if (new_ec == old_pathkey->pk_eclass)
+			return false;
+	}
+
+	return true;
+}
+
 /*
  * pathkey_is_redundant
  *	   Is a pathkey redundant with one already in the given list?
@@ -1152,6 +1176,44 @@ make_pathkeys_for_sortclauses(PlannerInfo *root,
 	return pathkeys;
 }
 
+/*
+ * make_pathkeys_for_uniquekeyclauses
+ *		Generate a pathkeys list to be used for uniquekey clauses
+ */
+List *
+make_pathkeys_for_uniquekeys(PlannerInfo *root,
+							 List *sortclauses,
+							 List *tlist)
+{
+	List	   *pathkeys = NIL;
+	ListCell   *l;
+
+	foreach(l, sortclauses)
+	{
+		SortGroupClause *sortcl = (SortGroupClause *) lfirst(l);
+		Expr	   *sortkey;
+		PathKey    *pathkey;
+
+		sortkey = (Expr *) get_sortgroupclause_expr(sortcl, tlist);
+		Assert(OidIsValid(sortcl->sortop));
+		pathkey = make_pathkey_from_sortop(root,
+										   sortkey,
+										   root->nullable_baserels,
+										   sortcl->sortop,
+										   sortcl->nulls_first,
+										   sortcl->tleSortGroupRef,
+										   true);
+
+		if (EC_MUST_BE_REDUNDANT(pathkey->pk_eclass))
+			continue;
+
+		if (pathkey_is_unique(pathkey, pathkeys))
+			pathkeys = lappend(pathkeys, pathkey);
+	}
+
+	return pathkeys;
+}
+
 /****************************************************************************
  *		PATHKEYS AND MERGECLAUSES
  ****************************************************************************/
diff --git a/src/backend/optimizer/path/uniquekeys.c b/src/backend/optimizer/path/uniquekeys.c
new file mode 100644
index 0000000000..d2525771e3
--- /dev/null
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -0,0 +1,92 @@
+/*-------------------------------------------------------------------------
+ *
+ * uniquekeys.c
+ *	  Utilities for matching and building unique keys
+ *
+ * Portions Copyright (c) 2020, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/optimizer/path/uniquekeys.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "optimizer/paths.h"
+
+/*
+ * build_uniquekeys
+ * 		Preparing list of pathkeys keys which are considered to be unique for
+ * 		this query.
+ *
+ * For now used only for distinct clauses, where redundant keys	need to be
+ * preserved e.g. for skip scan. Justification for this function existence is
+ * future plans to make it produce actual UniqueKey list. 
+ */
+List*
+build_uniquekeys(PlannerInfo *root, List *sortclauses)
+{
+	List *result = NIL;
+	List *sortkeys;
+	ListCell *l;
+	List *exprs = NIL;
+
+	sortkeys = make_pathkeys_for_uniquekeys(root,
+											sortclauses,
+											root->processed_tlist);
+
+	/* Create a uniquekey and add it to the list */
+	foreach(l, sortkeys)
+	{
+		PathKey    *pathkey = (PathKey *) lfirst(l);
+		EquivalenceClass *ec = pathkey->pk_eclass;
+		EquivalenceMember *mem = (EquivalenceMember*) lfirst(list_head(ec->ec_members));
+		exprs = lappend(exprs, mem->em_expr);
+	}
+
+	result = lappend(result, exprs);
+
+	return result;
+}
+
+/*
+ * query_has_uniquekeys_for
+ * 		Check if the specified unique keys matching all query level unique
+ * 		keys.
+ *
+ * The main use is to verify that unique keys for some path are covering all
+ * requested query unique keys. Based on this information a path could be
+ * rejected if it satisfy uniqueness only partially.
+ */
+bool
+query_has_uniquekeys_for(PlannerInfo *root, List *path_uniquekeys,
+						 bool allow_multinulls)
+{
+	ListCell *lc;
+	ListCell *lc2;
+
+	/* root->query_uniquekeys are the requested DISTINCT clauses on query level
+	 * path_uniquekeys are the unique keys on current path. All requested
+	 * query_uniquekeys must be satisfied by the path_uniquekeys.
+	 */
+	foreach(lc, root->query_uniquekeys)
+	{
+		List *query_ukey = lfirst_node(List, lc);
+		bool satisfied = false;
+		foreach(lc2, path_uniquekeys)
+		{
+			List *ukey = lfirst_node(List, lc2);
+			if (list_length(ukey) == 0 &&
+				list_length(query_ukey) != 0)
+				continue;
+			if (list_is_subset(ukey, query_ukey))
+			{
+				satisfied = true;
+				break;
+			}
+		}
+		if (!satisfied)
+			return false;
+	}
+	return true;
+}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index bd09f85aea..b67bff8ccc 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -3102,12 +3102,18 @@ standard_qp_callback(PlannerInfo *root, void *extra)
 
 	if (parse->distinctClause &&
 		grouping_is_sortable(parse->distinctClause))
+	{
 		root->distinct_pathkeys =
 			make_pathkeys_for_sortclauses(root,
 										  parse->distinctClause,
 										  tlist);
+		root->query_uniquekeys = build_uniquekeys(root, parse->distinctClause);
+	}
 	else
+	{
 		root->distinct_pathkeys = NIL;
+		root->query_uniquekeys = NIL;
+	}
 
 	root->sort_pathkeys =
 		make_pathkeys_for_sortclauses(root,
@@ -4493,13 +4499,19 @@ create_final_distinct_paths(PlannerInfo *root, RelOptInfo *input_rel,
 			Path	   *path = (Path *) lfirst(lc);
 
 			if (pathkeys_contained_in(needed_pathkeys, path->pathkeys))
-			{
 				add_path(distinct_rel, (Path *)
 						 create_upper_unique_path(root, distinct_rel,
 												  path,
 												  list_length(root->distinct_pathkeys),
 												  numDistinctRows));
-			}
+		}
+
+		foreach(lc, input_rel->unique_pathlist)
+		{
+			Path	   *path = (Path *) lfirst(lc);
+
+			if (query_has_uniquekeys_for(root, path->uniquekeys, false))
+				add_path(distinct_rel, path);
 		}
 
 		/* For explicit-sort case, always use the more rigorous clause */
@@ -7109,6 +7121,26 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 		}
 	}
 
+	foreach(lc, rel->unique_pathlist)
+	{
+		Path	   *subpath = (Path *) lfirst(lc);
+
+		/* Shouldn't have any parameterized paths anymore */
+		Assert(subpath->param_info == NULL);
+
+		if (tlist_same_exprs)
+			subpath->pathtarget->sortgrouprefs =
+				scanjoin_target->sortgrouprefs;
+		else
+		{
+			Path	   *newpath;
+
+			newpath = (Path *) create_projection_path(root, rel, subpath,
+													  scanjoin_target);
+			lfirst(lc) = newpath;
+		}
+	}
+
 	/*
 	 * Now, if final scan/join target contains SRFs, insert ProjectSetPath(s)
 	 * atop each existing path.  (Note that this function doesn't look at the
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 5c32c96b71..abb77d867e 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -416,10 +416,10 @@ set_cheapest(RelOptInfo *parent_rel)
  * 'parent_rel' is the relation entry to which the path corresponds.
  * 'new_path' is a potential path for parent_rel.
  *
- * Returns nothing, but modifies parent_rel->pathlist.
+ * Returns modified pathlist.
  */
-void
-add_path(RelOptInfo *parent_rel, Path *new_path)
+static List *
+add_path_to(RelOptInfo *parent_rel, List *pathlist, Path *new_path)
 {
 	bool		accept_new = true;	/* unless we find a superior old path */
 	int			insert_at = 0;	/* where to insert new item */
@@ -440,7 +440,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	 * for more than one old path to be tossed out because new_path dominates
 	 * it.
 	 */
-	foreach(p1, parent_rel->pathlist)
+	foreach(p1, pathlist)
 	{
 		Path	   *old_path = (Path *) lfirst(p1);
 		bool		remove_old = false; /* unless new proves superior */
@@ -584,8 +584,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		 */
 		if (remove_old)
 		{
-			parent_rel->pathlist = foreach_delete_current(parent_rel->pathlist,
-														  p1);
+			pathlist = foreach_delete_current(pathlist, p1);
 
 			/*
 			 * Delete the data pointed-to by the deleted cell, if possible
@@ -612,8 +611,7 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 	if (accept_new)
 	{
 		/* Accept the new path: insert it at proper place in pathlist */
-		parent_rel->pathlist =
-			list_insert_nth(parent_rel->pathlist, insert_at, new_path);
+		pathlist = list_insert_nth(pathlist, insert_at, new_path);
 	}
 	else
 	{
@@ -621,6 +619,23 @@ add_path(RelOptInfo *parent_rel, Path *new_path)
 		if (!IsA(new_path, IndexPath))
 			pfree(new_path);
 	}
+
+	return pathlist;
+}
+
+void
+add_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->pathlist = add_path_to(parent_rel,
+									   parent_rel->pathlist, new_path);
+}
+
+void
+add_unique_path(RelOptInfo *parent_rel, Path *new_path)
+{
+	parent_rel->unique_pathlist = add_path_to(parent_rel,
+											  parent_rel->unique_pathlist,
+											  new_path);
 }
 
 /*
@@ -2662,6 +2677,7 @@ create_projection_path(PlannerInfo *root,
 	pathnode->path.pathkeys = subpath->pathkeys;
 
 	pathnode->subpath = subpath;
+	pathnode->path.uniquekeys = subpath->uniquekeys;
 
 	/*
 	 * We might not need a separate Result node.  If the input plan node type
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 1f3845b3fe..056b13826a 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -293,6 +293,7 @@ struct PlannerInfo
 
 	List	   *query_pathkeys; /* desired pathkeys for query_planner() */
 
+	List	   *query_uniquekeys; /* unique keys required for the query */
 	List	   *group_pathkeys; /* groupClause pathkeys, if any */
 	List	   *window_pathkeys;	/* pathkeys of bottom window, if any */
 	List	   *distinct_pathkeys;	/* distinctClause pathkeys, if any */
@@ -695,6 +696,7 @@ typedef struct RelOptInfo
 	List	   *pathlist;		/* Path structures */
 	List	   *ppilist;		/* ParamPathInfos used in pathlist */
 	List	   *partial_pathlist;	/* partial Paths */
+	List       *unique_pathlist;    /* unique Paths */
 	struct Path *cheapest_startup_path;
 	struct Path *cheapest_total_path;
 	struct Path *cheapest_unique_path;
@@ -883,6 +885,7 @@ struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanskip;		/* can AM skip duplicate values? */
 	bool		amcanparallel;	/* does AM support parallel scan? */
 	bool		amcanmarkpos;	/* does AM support mark/restore? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
@@ -1196,6 +1199,8 @@ typedef struct Path
 
 	List	   *pathkeys;		/* sort ordering of path's output */
 	/* pathkeys is a List of PathKey nodes; see above */
+
+	List	   *uniquekeys;	/* the unique keys, or NIL if none */
 } Path;
 
 /* Macro for extracting a path's parameterization relids; beware double eval */
diff --git a/src/include/nodes/pg_list.h b/src/include/nodes/pg_list.h
index 2cb9d1371d..4ac871fd16 100644
--- a/src/include/nodes/pg_list.h
+++ b/src/include/nodes/pg_list.h
@@ -567,6 +567,7 @@ extern pg_nodiscard List *list_delete_last(List *list);
 extern pg_nodiscard List *list_delete_first_n(List *list, int n);
 extern pg_nodiscard List *list_delete_nth_cell(List *list, int n);
 extern pg_nodiscard List *list_delete_cell(List *list, ListCell *cell);
+extern bool list_is_subset(const List *members, const List *target);
 
 extern List *list_union(const List *list1, const List *list2);
 extern List *list_union_ptr(const List *list1, const List *list2);
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 620eeda2d6..bb6d730e93 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -27,6 +27,7 @@ extern int	compare_fractional_path_costs(Path *path1, Path *path2,
 										  double fraction);
 extern void set_cheapest(RelOptInfo *parent_rel);
 extern void add_path(RelOptInfo *parent_rel, Path *new_path);
+extern void add_unique_path(RelOptInfo *parent_rel, Path *new_path);
 extern bool add_path_precheck(RelOptInfo *parent_rel,
 							  Cost startup_cost, Cost total_cost,
 							  List *pathkeys, Relids required_outer);
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 0c3a0b90c8..3dfa21adad 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -229,6 +229,9 @@ extern List *build_join_pathkeys(PlannerInfo *root,
 extern List *make_pathkeys_for_sortclauses(PlannerInfo *root,
 										   List *sortclauses,
 										   List *tlist);
+extern List *make_pathkeys_for_uniquekeys(PlannerInfo *root,
+										  List *sortclauses,
+										  List *tlist);
 extern void initialize_mergeclause_eclasses(PlannerInfo *root,
 											RestrictInfo *restrictinfo);
 extern void update_mergeclause_eclasses(PlannerInfo *root,
@@ -255,4 +258,10 @@ extern PathKey *make_canonical_pathkey(PlannerInfo *root,
 extern void add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 									List *live_childrels);
 
+extern bool query_has_uniquekeys_for(PlannerInfo *root,
+									 List *exprs,
+									 bool allow_multinulls);
+
+extern List *build_uniquekeys(PlannerInfo *root, List *sortclauses);
+
 #endif							/* PATHS_H */
-- 
2.32.0

v41-0002-Index-skip-scan.patchtext/x-diff; charset=us-asciiDownload

From 1f61de293ad1eef7e91971c4c26aab031ae205c0 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sat, 8 Jan 2022 17:16:49 +0100
Subject: [PATCH v41 2/6] Index skip scan

Allow IndexOnlyScan to skip duplicated tuples based on search key prefix
(a trick also known as Index Skip Scan or Loose Index Scan, see in the
wiki [1]). The idea is to avoid scanning all equal values of a key, as
soon as a new value is found, restart the search by looking for a larger
value. This approach is much faster when the index has many equal keys.

Implemented via equipping IndexPath with indexskipprefix field and
creating an extra IndexPath with such prefix if suitable unique
expressions are present. On the execution size a new index am function
amskip is introduced to provide index specific implementation for such
skipping. To simplify potential amskip implementations,
ExecSupportsBackwardScan now returns false in case if index skip scan is
used, otherwise amskip has to deal with scroll cursor and be prepared to
handle different advance/read directions. ExecSupportsBackwardScan may
seem to have too big scope, but looks like now it used only together
with cursorOptions checks for CURSOR_OPT_SCROLL.

Original patch and design were proposed by Thomas Munro [2], revived and
improved by Dmitry Dolgov and Jesper Pedersen.

[1] https://wiki.postgresql.org/wiki/Loose_indexscan
[2] https://www.postgresql.org/message-id/flat/CADLWmXXbTSBxP-MzJuPAYSsL_2f0iPm5VWPbCvDbVvfX93FKkw%40mail.gmail.com

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 contrib/bloom/blutils.c                       |   1 +
 doc/src/sgml/config.sgml                      |  15 ++
 doc/src/sgml/indexam.sgml                     |  43 ++++++
 doc/src/sgml/indices.sgml                     |  23 +++
 src/backend/access/brin/brin.c                |   1 +
 src/backend/access/gin/ginutil.c              |   1 +
 src/backend/access/gist/gist.c                |   1 +
 src/backend/access/hash/hash.c                |   1 +
 src/backend/access/index/indexam.c            |  16 ++
 src/backend/access/spgist/spgutils.c          |   1 +
 src/backend/commands/explain.c                |  23 +++
 src/backend/executor/execAmi.c                |  32 +++-
 src/backend/executor/nodeIndexonlyscan.c      |  47 +++++-
 src/backend/nodes/copyfuncs.c                 |   1 +
 src/backend/nodes/outfuncs.c                  |   1 +
 src/backend/nodes/readfuncs.c                 |   1 +
 src/backend/optimizer/path/costsize.c         |   1 +
 src/backend/optimizer/path/indxpath.c         | 140 +++++++++++++++++-
 src/backend/optimizer/path/pathkeys.c         |  54 ++++++-
 src/backend/optimizer/plan/createplan.c       |  10 +-
 src/backend/optimizer/util/pathnode.c         |  37 +++++
 src/backend/optimizer/util/plancat.c          |   1 +
 src/backend/utils/misc/guc.c                  |  10 ++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/amapi.h                    |   6 +
 src/include/access/genam.h                    |   1 +
 src/include/access/sdir.h                     |   7 +
 src/include/nodes/execnodes.h                 |   2 +
 src/include/nodes/pathnodes.h                 |   4 +
 src/include/nodes/plannodes.h                 |   2 +
 src/include/optimizer/cost.h                  |   1 +
 src/include/optimizer/pathnode.h              |   4 +
 src/include/optimizer/paths.h                 |   5 +-
 src/test/regress/expected/sysviews.out        |   3 +-
 34 files changed, 478 insertions(+), 19 deletions(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index a434cf93ef..3b312c039d 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -134,6 +134,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = blcostestimate;
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7a48973b3c..e43295861f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -5035,6 +5035,21 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-enable-indexskipscan" xreflabel="enable_indexskipscan">
+      <term><varname>enable_indexskipscan</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>enable_indexskipscan</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables or disables the query planner's use of index-skip-scan plan
+        types (see <xref linkend="indexes-index-skip-scans"/>). The default is
+        <literal>on</literal>.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-enable-material" xreflabel="enable_material">
       <term><varname>enable_material</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index d4163c96e9..31081d0f8d 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -153,6 +153,7 @@ typedef struct IndexAmRoutine
     amendscan_function amendscan;
     ammarkpos_function ammarkpos;       /* can be NULL */
     amrestrpos_function amrestrpos;     /* can be NULL */
+    amskip_function amskip;             /* can be NULL */
 
     /* interface functions to support parallel index scans */
     amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
@@ -779,6 +780,48 @@ amrestrpos (IndexScanDesc scan);
 
   <para>
 <programlisting>
+bool
+amskip (IndexScanDesc scan,
+        ScanDirection direction,
+        int prefix);
+</programlisting>
+  Skip past all tuples where the first 'prefix' columns have the same value as
+  the last tuple returned in the current scan. The arguments are:
+
+   <variablelist>
+    <varlistentry>
+     <term><parameter>scan</parameter></term>
+     <listitem>
+      <para>
+       Index scan information
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>direction</parameter></term>
+     <listitem>
+      <para>
+       The direction in which data is advancing.
+      </para>
+     </listitem>
+    </varlistentry>
+
+    <varlistentry>
+     <term><parameter>prefix</parameter></term>
+     <listitem>
+      <para>
+        Distinct prefix size.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+
+  </para>
+
+  <para>
+<programlisting>
 Size
 amestimateparallelscan (void);
 </programlisting>
diff --git a/doc/src/sgml/indices.sgml b/doc/src/sgml/indices.sgml
index 023157d888..ab9595d37f 100644
--- a/doc/src/sgml/indices.sgml
+++ b/doc/src/sgml/indices.sgml
@@ -1297,6 +1297,29 @@ SELECT target FROM tests WHERE subject = 'some-subject' AND success;
    and later will recognize such cases and allow index-only scans to be
    generated, but older versions will not.
   </para>
+
+  <sect2 id="indexes-index-skip-scans">
+    <title>Index Skip Scans</title>
+
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index</primary>
+      <secondary>index-skip scans</secondary>
+    </indexterm>
+    <indexterm zone="indexes-index-skip-scans">
+      <primary>index-skip scan</primary>
+    </indexterm>
+
+    <para>
+     When the rows retrieved from an index scan are then deduplicated by
+     eliminating rows matching on a prefix of index keys (e.g. when using
+     <literal>SELECT DISTINCT</literal>), the planner will consider
+     skipping groups of rows with a matching key prefix. When a row with
+     a particular prefix is found, remaining rows with the same key prefix
+     are skipped.  The larger the number of rows with the same key prefix
+     rows (i.e. the lower the number of distinct key prefixes in the index),
+     the more efficient this is.
+    </para>
+  </sect2>
  </sect1>
 
 
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 4366010768..0d04b299d3 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -119,6 +119,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = brincostestimate;
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 3d15701a01..56292eb822 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -67,6 +67,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = gincostestimate;
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8c6c744ab7..16a45f05bc 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -89,6 +89,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = gistcostestimate;
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index a259a301fa..41e5e9b594 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -86,6 +86,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = hashcostestimate;
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index fe80b8b0ba..bcf7c73467 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_skip		- advance past duplicate key values in a scan
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -739,6 +740,21 @@ index_can_return(Relation indexRelation, int attno)
 	return indexRelation->rd_indam->amcanreturn(indexRelation, attno);
 }
 
+/* ----------------
+ *		index_skip
+ *
+ *		Skip past all tuples where the first 'prefix' columns have the
+ *		same value as the last tuple returned in the current scan.
+ * ----------------
+ */
+bool
+index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+	SCAN_CHECKS;
+
+	return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+}
+
 /* ----------------
  *		index_getprocid
  *
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 1ae7492216..0b0dfa278f 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -73,6 +73,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
+	amroutine->amskip = NULL;
 	amroutine->amcostestimate = spgcostestimate;
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = spgproperty;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9f632285b6..4f5bd1d678 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -152,6 +152,7 @@ static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
 static void ExplainIndentText(ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
+static void ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es);
 static void escape_yaml(StringInfo buf, const char *str);
 
 
@@ -1120,6 +1121,21 @@ ExplainPreScanNode(PlanState *planstate, Bitmapset **rels_used)
 	return planstate_tree_walker(planstate, ExplainPreScanNode, rels_used);
 }
 
+/*
+ * ExplainIndexSkipScanKeys -
+ *	  Append information about index skip scan to es->str.
+ *
+ * Can be used to print the skip prefix size.
+ */
+static void
+ExplainIndexSkipScanKeys(int skipPrefixSize, ExplainState *es)
+{
+	if (skipPrefixSize > 0)
+	{
+		ExplainPropertyInteger("Distinct Prefix", NULL, skipPrefixSize, es);
+	}
+}
+
 /*
  * ExplainNode -
  *	  Appends a description of a plan tree to es->str
@@ -1477,6 +1493,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			{
 				IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
 
+
 				ExplainIndexScanDetails(indexonlyscan->indexid,
 										indexonlyscan->indexorderdir,
 										es);
@@ -1750,6 +1767,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 										   planstate, es);
 			break;
 		case T_IndexOnlyScan:
+			if (((IndexOnlyScan *) plan)->indexskipprefixsize > 0)
+			{
+				IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) plan;
+				ExplainPropertyBool("Skip scan", true, es);
+				ExplainIndexSkipScanKeys(indexonlyscan->indexskipprefixsize, es);
+			}
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexOnlyScan *) plan)->recheckqual)
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b6245994f0..ced8933f44 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -64,7 +64,7 @@
 #include "utils/rel.h"
 #include "utils/syscache.h"
 
-static bool IndexSupportsBackwardScan(Oid indexid);
+static bool IndexSupportsBackwardScan(Plan *node);
 
 
 /*
@@ -555,10 +555,8 @@ ExecSupportsBackwardScan(Plan *node)
 			return false;
 
 		case T_IndexScan:
-			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid);
-
 		case T_IndexOnlyScan:
-			return IndexSupportsBackwardScan(((IndexOnlyScan *) node)->indexid);
+			return IndexSupportsBackwardScan(node);
 
 		case T_SubqueryScan:
 			return ExecSupportsBackwardScan(((SubqueryScan *) node)->subplan);
@@ -598,16 +596,38 @@ ExecSupportsBackwardScan(Plan *node)
 
 /*
  * An IndexScan or IndexOnlyScan node supports backward scan only if the
- * index's AM does.
+ * index's AM does and no skip scan is used.
  */
 static bool
-IndexSupportsBackwardScan(Oid indexid)
+IndexSupportsBackwardScan(Plan *node)
 {
 	bool		result;
+	Oid			indexid = InvalidOid;
+	int         skip_prefix_size = 0;
 	HeapTuple	ht_idxrel;
 	Form_pg_class idxrelrec;
 	IndexAmRoutine *amroutine;
 
+	Assert(IsA(node, IndexScan) || IsA(node, IndexOnlyScan));
+	switch(nodeTag(node))
+	{
+		case T_IndexScan:
+			indexid = ((IndexScan *) node)->indexid;
+			break;
+
+		case T_IndexOnlyScan:
+			indexid = ((IndexOnlyScan *) node)->indexid;
+			skip_prefix_size = ((IndexOnlyScan *) node)->indexskipprefixsize;
+			break;
+
+		default:
+			elog(DEBUG2, "unrecognized node type: %d", (int) nodeTag(node));
+			break;
+	}
+
+	if (skip_prefix_size > 0)
+		return false;
+
 	/* Fetch the pg_class tuple of the index relation */
 	ht_idxrel = SearchSysCache1(RELOID, ObjectIdGetDatum(indexid));
 	if (!HeapTupleIsValid(ht_idxrel))
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index eb3ddd2943..40ad1b949b 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -41,6 +41,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "storage/itemptr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -62,9 +63,17 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	EState	   *estate;
 	ExprContext *econtext;
 	ScanDirection direction;
+	ScanDirection readDirection;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
-	ItemPointer tid;
+	ItemPointer tid = NULL;
+	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
+
+	/*
+	 * Tells if the current position was reached via skipping. In this case
+	 * there is no nead for the index_getnext_tid
+	 */
+	bool skipped = false;
 
 	/*
 	 * extract necessary information from index scan node
@@ -72,7 +81,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexOnlyScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexonlyscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -115,15 +124,43 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_NumOrderByKeys);
 	}
 
+	/*
+	 * Check if we need to skip to the next key prefix.
+	 */
+	if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+	{
+		if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+		{
+			/*
+			 * Reached end of index. At this point currPos is invalidated, and
+			 * we need to reset ioss_FirstTupleEmitted, since otherwise after
+			 * going backwards, reaching the end of index, and going forward
+			 * again we apply skip again. It would be incorrect and lead to an
+			 * extra skipped item.
+			 */
+			node->ioss_FirstTupleEmitted = false;
+			return ExecClearTuple(slot);
+		}
+		else
+		{
+			skipped = true;
+			tid = &scandesc->xs_heaptid;
+		}
+	}
+
+	readDirection = skipped ? indexonlyscan->indexorderdir : direction;
+
 	/*
 	 * OK, now that we have what we need, fetch the next tuple.
 	 */
-	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
+	while (skipped || (tid = index_getnext_tid(scandesc, readDirection)) != NULL)
 	{
 		bool		tuple_from_heap = false;
 
 		CHECK_FOR_INTERRUPTS();
 
+		skipped = false;
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
@@ -248,6 +285,8 @@ IndexOnlyNext(IndexOnlyScanState *node)
 							  ItemPointerGetBlockNumber(tid),
 							  estate->es_snapshot);
 
+		node->ioss_FirstTupleEmitted = true;
+
 		return slot;
 	}
 
@@ -502,6 +541,8 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
+	indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->ioss_FirstTupleEmitted = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d4f8455a2b..fe0d92ad46 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -523,6 +523,7 @@ _copyIndexOnlyScan(const IndexOnlyScan *from)
 	COPY_NODE_FIELD(indexorderby);
 	COPY_NODE_FIELD(indextlist);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 6bdad462c7..06eb1a89e9 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -584,6 +584,7 @@ _outIndexOnlyScan(StringInfo str, const IndexOnlyScan *node)
 	WRITE_NODE_FIELD(indexorderby);
 	WRITE_NODE_FIELD(indextlist);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 3f68f7c18d..169fb408d3 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1888,6 +1888,7 @@ _readIndexOnlyScan(void)
 	READ_NODE_FIELD(indexorderby);
 	READ_NODE_FIELD(indextlist);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 4d9f3b4bb6..6c42ae2121 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -133,6 +133,7 @@ int			max_parallel_workers_per_gather = 2;
 bool		enable_seqscan = true;
 bool		enable_indexscan = true;
 bool		enable_indexonlyscan = true;
+bool		enable_indexskipscan = true;
 bool		enable_bitmapscan = true;
 bool		enable_tidscan = true;
 bool		enable_sort = true;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 0ef70ad7f1..00526f3476 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -784,6 +784,16 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	{
 		IndexPath  *ipath = (IndexPath *) lfirst(lc);
 
+		/*
+		 * To prevent unique paths from index skip scans being potentially used
+		 * when not needed scan keep them in a separate pathlist.
+		*/
+		if (ipath->indexskipprefix != 0)
+		{
+			add_unique_path(rel, (Path *) ipath);
+			continue;
+		}
+
 		if (index->amhasgettuple)
 			add_path(rel, (Path *) ipath);
 
@@ -866,12 +876,15 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	double		loop_count;
 	List	   *orderbyclauses;
 	List	   *orderbyclausecols;
-	List	   *index_pathkeys;
+	List	   *index_pathkeys = NIL;
 	List	   *useful_pathkeys;
+	List	   *index_pathkeys_pos = NIL;
 	bool		found_lower_saop_clause;
 	bool		pathkeys_possibly_useful;
 	bool		index_is_ordered;
 	bool		index_only_scan;
+	bool		not_empty_qual = false;
+	bool		can_skip;
 	int			indexcol;
 
 	/*
@@ -989,7 +1002,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	if (index_is_ordered && pathkeys_possibly_useful)
 	{
 		index_pathkeys = build_index_pathkeys(root, index,
-											  ForwardScanDirection);
+											  ForwardScanDirection,
+											  &index_pathkeys_pos);
 		useful_pathkeys = truncate_useless_pathkeys(root, rel,
 													index_pathkeys);
 		orderbyclauses = NIL;
@@ -1021,6 +1035,72 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	index_only_scan = (scantype != ST_BITMAPSCAN &&
 					   check_index_only(rel, index));
 
+	/* Check if an index skip scan is possible. */
+	can_skip = enable_indexskipscan & index->amcanskip & index_only_scan;
+
+	if (can_skip)
+	{
+		/*
+		 * Skip scan is not supported when there are qual conditions, which are
+		 * not covered by index. The reason for that is that those conditions
+		 * are evaluated later, already after skipping was applied.
+		 *
+		 * TODO: This implementation is too restrictive, and doesn't allow e.g.
+		 * index expressions. For that we need to examine index_clauses too.
+		 */
+		if (root->parse->jointree != NULL)
+		{
+			ListCell *lc;
+
+			foreach(lc, (List *) root->parse->jointree->quals)
+			{
+				Node *expr, *qual = (Node *) lfirst(lc);
+				OpExpr *expr_op;
+				Var *var;
+				bool found = false;
+
+				if (!is_opclause(qual))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				expr = get_leftop(qual);
+				expr_op = (OpExpr *) qual;
+
+				if (!IsA(expr, Var))
+				{
+					not_empty_qual = true;
+					break;
+				}
+
+				var = (Var *) expr;
+
+				/*
+				 * Check if the qual operator is indexable by any columns of
+				 * the index, test collation and opfamily.
+				 */
+				for (int i = 0; i < index->ncolumns; i++)
+				{
+					if (index->indexkeys[i] == var->varattno &&
+						IndexCollMatchesExprColl(index->indexcollations[i],
+												 expr_op->inputcollid) &&
+						op_in_opfamily(expr_op->opno, index->opfamily[i]))
+					{
+						found = true;
+						break;
+					}
+				}
+
+				if (!found)
+				{
+					not_empty_qual = true;
+					break;
+				}
+			}
+		}
+	}
+
 	/*
 	 * 4. Generate an indexscan path if there are relevant restriction clauses
 	 * in the current clauses, OR the index ordering is potentially useful for
@@ -1044,6 +1124,33 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  false);
 		result = lappend(result, ipath);
 
+		/* Consider index skip scan as well */
+		if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
+		{
+			int numusefulkeys = list_length(useful_pathkeys);
+			int numsortkeys = list_length(root->query_pathkeys);
+
+			if (numusefulkeys == numsortkeys)
+			{
+				int prefix;
+				if (list_length(root->distinct_pathkeys) > 0)
+					prefix = find_index_prefix_for_pathkey(index_pathkeys,
+														   index_pathkeys_pos,
+														   llast_node(PathKey,
+														   root->distinct_pathkeys));
+				else
+					/*
+					 * All distinct keys are constant and optimized away.
+					 * Skipping with 1 is sufficient.
+					 */
+					prefix = 1;
+
+				result = lappend(result,
+								 create_skipscan_unique_path(root, index,
+															 (Path *) ipath, prefix));
+			}
+		}
+
 		/*
 		 * If appropriate, consider parallel index scan.  We don't allow
 		 * parallel index scan for bitmap index scans.
@@ -1082,7 +1189,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	if (index_is_ordered && pathkeys_possibly_useful)
 	{
 		index_pathkeys = build_index_pathkeys(root, index,
-											  BackwardScanDirection);
+											  BackwardScanDirection,
+											  &index_pathkeys_pos);
 		useful_pathkeys = truncate_useless_pathkeys(root, rel,
 													index_pathkeys);
 		if (useful_pathkeys != NIL)
@@ -1099,6 +1207,32 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  false);
 			result = lappend(result, ipath);
 
+			/* Consider index skip scan as well */
+			if (root->query_uniquekeys != NULL && can_skip && !not_empty_qual)
+			{
+				int numusefulkeys = list_length(useful_pathkeys);
+				int numsortkeys = list_length(root->query_pathkeys);
+
+				if (numusefulkeys == numsortkeys)
+				{
+					int prefix;
+					if (list_length(root->distinct_pathkeys) > 0)
+						prefix = find_index_prefix_for_pathkey(index_pathkeys,
+															   index_pathkeys_pos,
+															   llast_node(PathKey,
+															   root->distinct_pathkeys));
+					else
+						/* all are distinct keys are constant and optimized away.
+						 * skipping with 1 is sufficient as all are constant anyway
+						 */
+						prefix = 1;
+
+					result = lappend(result,
+									 create_skipscan_unique_path(root, index,
+																 (Path *) ipath, prefix));
+				}
+			}
+
 			/* If appropriate, consider parallel index scan */
 			if (index->amcanparallel &&
 				rel->consider_parallel && outer_relids == NULL &&
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index e2be1fbf90..cfdff4eee9 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -523,6 +523,47 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  *		NEW PATHKEY FORMATION
  ****************************************************************************/
 
+/*
+ * Find the prefix size for a specific path key in an index. For example, an
+ * index with (a,b,c) finding path key b will return prefix 2. Optionally
+ * pathkeys_positions can be provided, to specify at which position in the
+ * original pathkey list this particular key could be found (this is helpful
+ * when we deal with redundant pathkeys).
+ *
+ * Returns 0 when not found.
+ */
+int
+find_index_prefix_for_pathkey(List *index_pathkeys,
+							  List *pathkeys_positions,
+							  PathKey *target_pathkey)
+{
+	ListCell   *lc;
+	int			i;
+
+	i = 0;
+	foreach(lc, index_pathkeys)
+	{
+		PathKey    *cpathkey = (PathKey *) lfirst(lc);
+
+		if (cpathkey == target_pathkey)
+		{
+			/*
+			 * Prefix expected to start from 1, increment positions since
+			 * they're 0 based.
+			 */
+			if (pathkeys_positions != NIL &&
+				pathkeys_positions->length > i)
+				return list_nth_int(pathkeys_positions, i) + 1;
+			else
+				return i + 1;
+		}
+
+		i++;
+	}
+
+	return 0;
+}
+
 /*
  * build_index_pathkeys
  *	  Build a pathkeys list that describes the ordering induced by an index
@@ -535,7 +576,9 @@ get_cheapest_parallel_safe_total_inner(List *paths)
  * We iterate only key columns of covering indexes, since non-key columns
  * don't influence index ordering.  The result is canonical, meaning that
  * redundant pathkeys are removed; it may therefore have fewer entries than
- * there are key columns in the index.
+ * there are key columns in the index. Since by removing redundant pathkeys the
+ * information about original position is lost, return it via positions
+ * argument.
  *
  * Another reason for stopping early is that we may be able to tell that
  * an index column's sort order is uninteresting for this query.  However,
@@ -546,7 +589,8 @@ get_cheapest_parallel_safe_total_inner(List *paths)
 List *
 build_index_pathkeys(PlannerInfo *root,
 					 IndexOptInfo *index,
-					 ScanDirection scandir)
+					 ScanDirection scandir,
+					 List **positions)
 {
 	List	   *retval = NIL;
 	ListCell   *lc;
@@ -555,6 +599,8 @@ build_index_pathkeys(PlannerInfo *root,
 	if (index->sortopfamily == NULL)
 		return NIL;				/* non-orderable index */
 
+	*positions = NIL;
+
 	i = 0;
 	foreach(lc, index->indextlist)
 	{
@@ -608,7 +654,11 @@ build_index_pathkeys(PlannerInfo *root,
 			 * for this query.  Add it to list, unless it's redundant.
 			 */
 			if (!pathkey_is_redundant(cpathkey, retval))
+			{
 				retval = lappend(retval, cpathkey);
+				*positions = lappend_int(*positions,
+										 foreach_current_index(lc));
+			}
 		}
 		else
 		{
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fa069a217c..511dda3a9f 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -191,7 +191,8 @@ static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 List *indexqual, List *recheckqual,
 										 List *indexorderby,
 										 List *indextlist,
-										 ScanDirection indexscandir);
+										 ScanDirection indexscandir,
+										 int skipprefix);
 static BitmapIndexScan *make_bitmap_indexscan(Index scanrelid, Oid indexid,
 											  List *indexqual,
 											  List *indexqualorig);
@@ -3108,7 +3109,8 @@ create_indexscan_plan(PlannerInfo *root,
 												stripped_indexquals,
 												fixed_indexorderbys,
 												indexinfo->indextlist,
-												best_path->indexscandir);
+												best_path->indexscandir,
+												best_path->indexskipprefix);
 	else
 		scan_plan = (Scan *) make_indexscan(tlist,
 											qpqual,
@@ -5482,7 +5484,8 @@ make_indexonlyscan(List *qptlist,
 				   List *recheckqual,
 				   List *indexorderby,
 				   List *indextlist,
-				   ScanDirection indexscandir)
+				   ScanDirection indexscandir,
+				   int skipPrefixSize)
 {
 	IndexOnlyScan *node = makeNode(IndexOnlyScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5498,6 +5501,7 @@ make_indexonlyscan(List *qptlist,
 	node->indexorderby = indexorderby;
 	node->indextlist = indextlist;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index abb77d867e..0c61795389 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3095,6 +3095,43 @@ create_upper_unique_path(PlannerInfo *root,
 	return pathnode;
 }
 
+/*
+ * create_skipscan_unique_path
+ *	  Creates a pathnode the same as an existing IndexPath except based on
+ *	  skipping duplicate values.  This may or may not be cheaper than using
+ *	  create_upper_unique_path.
+ *
+ * The input path must be an IndexPath for an index that supports amskip.
+ */
+IndexPath *
+create_skipscan_unique_path(PlannerInfo *root, IndexOptInfo *index,
+							Path *basepath, int prefix)
+{
+	IndexPath 	*pathnode = makeNode(IndexPath);
+	int 		numDistinctRows;
+	List 		*uniqExprs;
+
+	Assert(IsA(basepath, IndexPath));
+
+	/* We don't want to modify basepath, so make a copy. */
+	memcpy(pathnode, basepath, sizeof(IndexPath));
+
+	uniqExprs = linitial_node(List, root->query_uniquekeys);
+
+	Assert(prefix > 0);
+	pathnode->indexskipprefix = prefix;
+	pathnode->path.uniquekeys = root->query_uniquekeys;
+
+	numDistinctRows = estimate_num_groups(root, uniqExprs,
+										  pathnode->path.rows,
+										  NULL, NULL);
+
+	pathnode->path.total_cost = pathnode->path.startup_cost * numDistinctRows;
+	pathnode->path.rows = numDistinctRows;
+
+	return pathnode;
+}
+
 /*
  * create_agg_path
  *	  Creates a pathnode that represents performing aggregation/grouping
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index a5002ad895..1e6fb0c543 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -272,6 +272,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanskip = (amroutine->amskip != NULL);
 			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = amroutine->amgetbitmap != NULL &&
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 932aefc777..6410c6ede7 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1010,6 +1010,16 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+	{
+		{"enable_indexskipscan", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enables the planner's use of index-skip-scan plans."),
+			NULL,
+			GUC_EXPLAIN
+		},
+		&enable_indexskipscan,
+		true,
+		NULL, NULL, NULL
+	},
 	{
 		{"enable_bitmapscan", PGC_USERSET, QUERY_TUNING_METHOD,
 			gettext_noop("Enables the planner's use of bitmap-scan plans."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 4cf5b26a36..17ff364a7a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -370,6 +370,7 @@
 #enable_incremental_sort = on
 #enable_indexscan = on
 #enable_indexonlyscan = on
+#enable_indexskipscan = on
 #enable_material = on
 #enable_memoize = on
 #enable_mergejoin = on
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index a382551a98..cb2f48a1bc 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -173,6 +173,11 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 									 ScanDirection direction);
 
+/* skip past duplicates in a given prefix */
+typedef bool (*amskip_function) (IndexScanDesc scan,
+								 ScanDirection dir,
+								 int prefix);
+
 /* fetch all valid tuples */
 typedef int64 (*amgetbitmap_function) (IndexScanDesc scan,
 									   TIDBitmap *tbm);
@@ -277,6 +282,7 @@ typedef struct IndexAmRoutine
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;	/* can be NULL */
 	amrestrpos_function amrestrpos; /* can be NULL */
+	amskip_function amskip;				/* can be NULL */
 
 	/* interface functions to support parallel index scans */
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 134b20f1e6..d13d95c458 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -183,6 +183,7 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
 												   IndexBulkDeleteResult *istat);
 extern bool index_can_return(Relation indexRelation, int attno);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/sdir.h b/src/include/access/sdir.h
index 1ab4d5e19a..fd71629da4 100644
--- a/src/include/access/sdir.h
+++ b/src/include/access/sdir.h
@@ -55,4 +55,11 @@ typedef enum ScanDirection
 #define ScanDirectionIsForward(direction) \
 	((bool) ((direction) == ForwardScanDirection))
 
+/*
+ * ScanDirectionsAreOpposite
+ *		True iff scan directions are backward/forward or forward/backward.
+ */
+#define ScanDirectionsAreOpposite(dirA, dirB) \
+	((bool) (dirA != NoMovementScanDirection && dirA == -dirB))
+
 #endif							/* SDIR_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 44dd73fc80..cf36e6c0e6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1533,6 +1533,8 @@ typedef struct IndexOnlyScanState
 	struct IndexScanDescData *ioss_ScanDesc;
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
+	int         ioss_SkipPrefixSize;
+	bool		ioss_FirstTupleEmitted;
 	Size		ioss_PscanLen;
 } IndexOnlyScanState;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 056b13826a..40997ee759 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1243,6 +1243,9 @@ typedef struct Path
  * we need not recompute them when considering using the same index in a
  * bitmap index/heap scan (see BitmapHeapPath).  The costs of the IndexPath
  * itself represent the costs of an IndexScan or IndexOnlyScan plan type.
+ *
+ * 'indexskipprefix' represents the number of columns to consider for skip
+ * scans.
  *----------
  */
 typedef struct IndexPath
@@ -1255,6 +1258,7 @@ typedef struct IndexPath
 	ScanDirection indexscandir;
 	Cost		indextotalcost;
 	Selectivity indexselectivity;
+	int			indexskipprefix;
 } IndexPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 0b518ce6b2..6b3eefebc6 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -453,6 +453,8 @@ typedef struct IndexOnlyScan
 	List	   *indexorderby;	/* list of index ORDER BY exprs */
 	List	   *indextlist;		/* TargetEntry list describing index's cols */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for distinct
+										 * scans */
 } IndexOnlyScan;
 
 /* ----------------
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 356a51f370..03d5816c82 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -50,6 +50,7 @@ extern PGDLLIMPORT int max_parallel_workers_per_gather;
 extern PGDLLIMPORT bool enable_seqscan;
 extern PGDLLIMPORT bool enable_indexscan;
 extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_indexskipscan;
 extern PGDLLIMPORT bool enable_bitmapscan;
 extern PGDLLIMPORT bool enable_tidscan;
 extern PGDLLIMPORT bool enable_sort;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index bb6d730e93..227cda4bd7 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -218,6 +218,10 @@ extern UpperUniquePath *create_upper_unique_path(PlannerInfo *root,
 												 Path *subpath,
 												 int numCols,
 												 double numGroups);
+extern IndexPath *create_skipscan_unique_path(PlannerInfo *root,
+											  IndexOptInfo *index,
+											  Path *subpath,
+											  int prefix);
 extern AggPath *create_agg_path(PlannerInfo *root,
 								RelOptInfo *rel,
 								Path *subpath,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 3dfa21adad..72b3bd9059 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -212,8 +212,11 @@ extern Path *get_cheapest_fractional_path_for_pathkeys(List *paths,
 													   Relids required_outer,
 													   double fraction);
 extern Path *get_cheapest_parallel_safe_total_inner(List *paths);
+extern int find_index_prefix_for_pathkey(List *index_pathkeys,
+										 List *pathkey_positions,
+										 PathKey *target_pathkey);
 extern List *build_index_pathkeys(PlannerInfo *root, IndexOptInfo *index,
-								  ScanDirection scandir);
+								  ScanDirection scandir, List **positions);
 extern List *build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
 									  ScanDirection scandir, bool *partialkeys);
 extern List *build_expression_pathkey(PlannerInfo *root, Expr *expr,
diff --git a/src/test/regress/expected/sysviews.out b/src/test/regress/expected/sysviews.out
index 442eeb1e3f..1ada679d46 100644
--- a/src/test/regress/expected/sysviews.out
+++ b/src/test/regress/expected/sysviews.out
@@ -110,6 +110,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_incremental_sort        | on
  enable_indexonlyscan           | on
  enable_indexscan               | on
+ enable_indexskipscan           | on
  enable_material                | on
  enable_memoize                 | on
  enable_mergejoin               | on
@@ -122,7 +123,7 @@ select name, setting from pg_settings where name like 'enable%';
  enable_seqscan                 | on
  enable_sort                    | on
  enable_tidscan                 | on
-(20 rows)
+(21 rows)
 
 -- Test that the pg_timezone_names and pg_timezone_abbrevs views are
 -- more-or-less working.  We can't test their contents in any great detail
-- 
2.32.0

v41-0003-amskip-implementation-for-Btree.patchtext/x-diff; charset=us-asciiDownload

From 0fef4ca321421b442e964ee4ae9b5cb721452750 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Thu, 20 May 2021 21:13:38 +0200
Subject: [PATCH v41 3/6] amskip implementation for Btree

Btree implementation of index am method amskip for Index Skip Scan. To
make it more robust and suitable for both situations:

* small number of distinct values (e.g. because of the planner
underestimation)

* significant amounts of distinct values

a mixed approach is implemented. Instead of restarting the search for
every value, first we check if there is a next distinct value on the
current page. Only then if no such value was found restart and search
from the tree root.

No support for backward scan is implemented in case of a scroll cursor,
instead a Material node will be put on top.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 src/backend/access/nbtree/nbtree.c            |  12 +
 src/backend/access/nbtree/nbtsearch.c         | 215 +++++-
 src/include/access/nbtree.h                   |   5 +
 src/test/regress/expected/join.out            |   3 +
 src/test/regress/expected/select_distinct.out | 642 ++++++++++++++++++
 src/test/regress/sql/join.sql                 |   5 +
 src/test/regress/sql/select_distinct.sql      | 282 ++++++++
 7 files changed, 1163 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c9b4964c1e..7b2aa594fb 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -125,6 +125,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
+	amroutine->amskip = btskip;
 	amroutine->amcostestimate = btcostestimate;
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
@@ -376,6 +377,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	 */
 	so->currTuples = so->markTuples = NULL;
 
+	so->skipScanKey = NULL;
+
 	scan->xs_itupdesc = RelationGetDescr(rel);
 
 	scan->opaque = so;
@@ -442,6 +445,15 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	_bt_preprocess_array_keys(scan);
 }
 
+/*
+ * btskip() -- skip to the beginning of the next key prefix
+ */
+bool
+btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+{
+	return _bt_skip(scan, direction, prefix);
+}
+
 /*
  *	btendscan() -- close down a scan
  */
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 9d82d4904d..0ad761916a 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -45,7 +45,11 @@ static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
+static inline void _bt_update_skip_scankeys(IndexScanDesc scan,
+											Relation indexRel);
+static inline bool _bt_scankey_within_page(IndexScanDesc scan,
+										   BTScanInsert key,
+										   Buffer buf);
 
 /*
  *	_bt_drop_lock_and_maybe_pin()
@@ -1498,6 +1502,161 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 	return true;
 }
 
+/*
+ *  _bt_skip() -- Skip items that have the same prefix as the most recently
+ * 				  fetched index tuple.
+ *
+ * 		The current position is set so that a subsequent call to _bt_next will
+ * 		fetch the first tuple that differs in the leading 'prefix' keys.
+ *
+ * 		The current page is searched for the next unique value. If none is found
+ * 		we will do a scan from the root in order to find the next page with
+ * 		a unique value.
+ */
+bool
+_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTStack stack;
+	Buffer buf;
+	OffsetNumber offnum;
+	BTScanPosItem *currItem;
+	Relation 	 indexRel = scan->indexRelation;
+	bool scanstart = !BTScanPosIsValid(so->currPos);
+
+	/* We want to return tuples, and we need a starting point */
+	Assert(scan->xs_want_itup);
+	Assert(scan->xs_itup);
+
+	if (so->numKilled > 0)
+		_bt_killitems(scan);
+
+	/* If skipScanKey is NULL then we initialize it with _bt_mkscankey */
+	if (so->skipScanKey == NULL)
+	{
+		so->skipScanKey = _bt_mkscankey(indexRel, scan->xs_itup);
+		so->skipScanKey->keysz = prefix;
+		so->skipScanKey->scantid = NULL;
+	}
+	so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+	_bt_update_skip_scankeys(scan, indexRel);
+
+	/* Check if the next unique key can be found within the current page.
+	 * Since we do not lock the current page between jumps, it's possible
+	 * that it was splitted since the last time we saw it. This is fine in
+	 * case of scanning forward, since page split to the right and we are
+	 * still on the left most page. In case of scanning backwards it's
+	 * possible to loose some pages and we need to remember the previous
+	 * page, and then follow the right link from the current page until we
+	 * find the original one.
+	 *
+	 * Since the whole idea of checking the current page is to protect
+	 * ourselves and make more performant statistic mismatch case when
+	 * there are too many distinct values for jumping, it's not clear if
+	 * the complexity of this solution in case of backward scan is
+	 * justified, so for now just avoid it.
+	 */
+	if (BufferIsValid(so->currPos.buf) && ScanDirectionIsForward(dir))
+	{
+		_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+
+		if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf))
+		{
+			bool keyFound = false;
+
+			offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, so->currPos.buf);
+
+			/* Lock the page for SERIALIZABLE transactions */
+			PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(so->currPos.buf),
+							  scan->xs_snapshot);
+
+			/* We know in which direction to look */
+			_bt_initialize_more_data(so, dir);
+
+			/* Now read the data */
+			keyFound = _bt_readpage(scan, dir, offnum);
+
+			_bt_relbuf(indexRel, so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+
+			if (keyFound)
+			{
+				/* set IndexTuple */
+				currItem = &so->currPos.items[so->currPos.itemIndex];
+				scan->xs_heaptid = currItem->heapTid;
+				scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+				return true;
+			}
+		}
+		else
+			_bt_unlockbuf(indexRel, so->currPos.buf);
+	}
+
+	if (BufferIsValid(so->currPos.buf))
+	{
+		ReleaseBuffer(so->currPos.buf);
+		so->currPos.buf = InvalidBuffer;
+	}
+
+	/*
+	 * We haven't found scan key within the current page, so let's scan from
+	 * the root. Use _bt_search and _bt_binsrch to get the buffer and offset
+	 * number
+	 */
+	stack = _bt_search(scan->indexRelation, so->skipScanKey,
+					   &buf, BT_READ, scan->xs_snapshot);
+	_bt_freestack(stack);
+	so->currPos.buf = buf;
+	offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+
+	/* Lock the page for SERIALIZABLE transactions */
+	PredicateLockPage(scan->indexRelation, BufferGetBlockNumber(buf),
+					  scan->xs_snapshot);
+
+	/* We know in which direction to look */
+	_bt_initialize_more_data(so, dir);
+
+	/*
+	 * Simplest case is when both directions are forward, when we are already
+	 * at the next distinct key at the beginning of the series (so everything
+	 * else would be done in _bt_readpage)
+	 *
+	 * The case when both directions are backwards is also simple, but we need
+	 * to go one step back, since we need a last element from the previous
+	 * series.
+	 */
+	if (ScanDirectionIsBackward(dir) || (ScanDirectionIsForward(dir) & scanstart))
+		 offnum = OffsetNumberPrev(offnum);
+
+	/* Now read the data */
+	if (!_bt_readpage(scan, dir, offnum))
+	{
+		/*
+		 * There's no actually-matching data on this page.  Try to advance to
+		 * the next page.  Return false if there's no matching data at all.
+		 */
+		_bt_unlockbuf(indexRel, so->currPos.buf);
+		if (!_bt_steppage(scan, dir))
+		{
+			pfree(so->skipScanKey);
+			so->skipScanKey = NULL;
+			return false;
+		}
+	}
+	else
+		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+	/* And set IndexTuple */
+	currItem = &so->currPos.items[so->currPos.itemIndex];
+	scan->xs_heaptid = currItem->heapTid;
+	scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+	so->currPos.moreLeft = true;
+	so->currPos.moreRight = true;
+
+	return true;
+}
+
 /*
  *	_bt_readpage() -- Load data from current index page into so->currPos
  *
@@ -2494,3 +2653,57 @@ _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
 	so->numKilled = 0;			/* just paranoia */
 	so->markItemIndex = -1;		/* ditto */
 }
+
+/*
+ * _bt_update_skip_scankeys() -- set up new values for the existing scankeys
+ * based on the current index tuple
+ */
+static inline void
+_bt_update_skip_scankeys(IndexScanDesc scan, Relation indexRel)
+{
+	TupleDesc		itupdesc;
+	int			indnkeyatts,
+				i;
+	BTScanOpaque 	so = (BTScanOpaque) scan->opaque;
+	ScanKey			scankeys = so->skipScanKey->scankeys;
+
+	itupdesc = RelationGetDescr(indexRel);
+	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(indexRel);
+	for (i = 0; i < indnkeyatts; i++)
+	{
+		Datum datum;
+		bool null;
+		int flags;
+
+		datum = index_getattr(scan->xs_itup, i + 1, itupdesc, &null);
+		flags = (null ? SK_ISNULL : 0) |
+				(indexRel->rd_indoption[i] << SK_BT_INDOPTION_SHIFT);
+		scankeys[i].sk_flags = flags;
+		scankeys[i].sk_argument = datum;
+	}
+}
+
+/*
+ * _bt_scankey_within_page() -- check if the provided scankey could be found
+ * within a page, specified by the buffer.
+ *
+ * Scankey nextkey will tell us if we need to find a current key or the next
+ * one, which affects whether or not it's ok to be equal to the page highkey.
+ */
+static inline bool
+_bt_scankey_within_page(IndexScanDesc scan, BTScanInsert key, Buffer buf)
+{
+	OffsetNumber low, high;
+	Page page = BufferGetPage(buf);
+	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	int high_compare = key->nextkey ? 0 : 1;
+
+	low = P_FIRSTDATAKEY(opaque);
+	high = PageGetMaxOffsetNumber(page);
+
+	if (unlikely(high < low))
+		return false;
+
+	return (_bt_compare(scan->indexRelation, key, page, low) > 0 &&
+			_bt_compare(scan->indexRelation, key, page, high) < high_compare);
+}
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9fec6fb1a8..2c516654c2 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1064,6 +1064,9 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* Work space for _bt_skip */
+	BTScanInsert	skipScanKey;	/* used to control skipping */
+
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -1229,6 +1232,7 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 							   Snapshot snapshot);
 
@@ -1253,6 +1257,7 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
 extern bytea *btoptions(Datum reloptions, bool validate);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
 extern bool btproperty(Oid index_oid, int attno,
 					   IndexAMProperty prop, const char *propname,
 					   bool *res, bool *isnull);
diff --git a/src/test/regress/expected/join.out b/src/test/regress/expected/join.out
index 19caebabd0..ebad6d4ae1 100644
--- a/src/test/regress/expected/join.out
+++ b/src/test/regress/expected/join.out
@@ -4614,6 +4614,8 @@ select d.* from d left join (select * from b group by b.id, b.c_id) s
          ->  Seq Scan on d
 (8 rows)
 
+-- disable index skip scan to prevent it interfering with the plan
+set enable_indexskipscan to off;
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
@@ -4631,6 +4633,7 @@ select d.* from d left join (select distinct * from b) s
          ->  Seq Scan on d
 (9 rows)
 
+set enable_indexskipscan to on;
 -- check join removal works when uniqueness of the join condition is enforced
 -- by a UNION
 explain (costs off)
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index 748419cee0..36b3291a7f 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -375,3 +375,645 @@ SELECT null IS NOT DISTINCT FROM null as "yes";
  t
 (1 row)
 
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+CREATE INDEX ON distinct_a ((a + 1));
+ANALYZE distinct_a;
+SELECT DISTINCT a FROM distinct_a;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+ a 
+---
+ 1
+(1 row)
+
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+                       QUERY PLAN                       
+--------------------------------------------------------
+ Index Only Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+   Distinct Prefix: 1
+(3 rows)
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+-- test index skip scan for expressions
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT (a + 1) FROM distinct_a ORDER BY (a + 1);
+             QUERY PLAN             
+------------------------------------
+ Sort
+   Sort Key: ((a + 1))
+   ->  HashAggregate
+         Group Key: (a + 1)
+         ->  Seq Scan on distinct_a
+(5 rows)
+
+SELECT DISTINCT (a + 1) FROM distinct_a ORDER BY (a + 1);
+ ?column? 
+----------
+        2
+        3
+        4
+        5
+        6
+(5 rows)
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+ a | b 
+---+---
+ 1 | 2
+ 2 | 2
+ 3 | 2
+ 4 | 2
+ 5 | 2
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: true
+   Distinct Prefix: 2
+   Index Cond: (b = 2)
+(4 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Only Scan using distinct_a_b_a on distinct_a
+   Skip scan: true
+   Distinct Prefix: 2
+   Index Cond: (b = 2)
+(4 rows)
+
+DROP INDEX distinct_a_b_a;
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+FETCH FROM c;
+ a | b 
+---+---
+ 1 | 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a | b 
+---+---
+ 5 | 1
+ 4 | 1
+ 3 | 1
+ 2 | 1
+ 1 | 1
+(5 rows)
+
+END;
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+FETCH FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a | b 
+---+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+FETCH 6 FROM c;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a |   b   
+---+-------
+ 1 | 10000
+ 2 | 10000
+ 3 | 10000
+ 4 | 10000
+ 5 | 10000
+(5 rows)
+
+END;
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+                          QUERY PLAN                          
+--------------------------------------------------------------
+ Index Only Scan using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: true
+   Distinct Prefix: 1
+   Index Cond: (c = 2)
+(4 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 1 | 2
+ 3 | 1 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 1 | 2
+ 1 | 1 | 2
+(2 rows)
+
+END;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+                              QUERY PLAN                               
+-----------------------------------------------------------------------
+ Index Only Scan Backward using distinct_abc_a_b_c_idx on distinct_abc
+   Skip scan: true
+   Distinct Prefix: 1
+   Index Cond: (c = 2)
+(4 rows)
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+END;
+DROP TABLE distinct_abc;
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+                       QUERY PLAN                        
+---------------------------------------------------------
+ Unique
+   ->  Index Scan using distinct_a_a_b_idx on distinct_a
+         Index Cond: (b = 2)
+         Filter: (c = 10)
+(4 rows)
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+ a | a 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 3
+ 4 | 4
+ 5 | 5
+(5 rows)
+
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+ a | ?column? 
+---+----------
+ 1 |        1
+ 2 |        1
+ 3 |        1
+ 4 |        1
+ 5 |        1
+(5 rows)
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+FETCH FROM c;
+ a 
+---
+ 1
+(1 row)
+
+FETCH BACKWARD FROM c;
+ a 
+---
+(0 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+FETCH 6 FROM c;
+ a 
+---
+ 1
+ 2
+ 3
+ 4
+ 5
+(5 rows)
+
+FETCH BACKWARD 6 FROM c;
+ a 
+---
+ 5
+ 4
+ 3
+ 2
+ 1
+(5 rows)
+
+END;
+DROP TABLE distinct_a;
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 1
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+ a | b 
+---+---
+ 1 | 1
+ 2 | 2
+ 3 | 1
+ 4 | 1
+ 5 | 1
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 | 10000
+ 1 | 10000
+(5 rows)
+
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+ a |   b   
+---+-------
+ 5 | 10000
+ 4 | 10000
+ 3 | 10000
+ 2 |  9999
+ 1 | 10000
+(5 rows)
+
+DROP TABLE distinct_visibility;
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+                                 QUERY PLAN                                 
+----------------------------------------------------------------------------
+ Index Only Scan using distinct_boundaries_a_b_c_idx on distinct_boundaries
+   Skip scan: true
+   Distinct Prefix: 1
+   Index Cond: ((b >= 1) AND (c = 0))
+(4 rows)
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+ a | b | c 
+---+---+---
+ 1 | 2 | 0
+ 2 | 2 | 0
+ 3 | 2 | 0
+ 4 | 2 | 0
+ 5 | 2 | 0
+(5 rows)
+
+DROP TABLE distinct_boundaries;
+-- test tuple killing
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed where a = 3;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 1 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 5 | 1 | 1 | 10
+(4 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c | d  
+---+---+---+----
+ 5 | 1 | 1 | 10
+ 4 | 1 | 1 | 10
+ 2 | 1 | 1 | 10
+ 1 | 1 | 1 | 10
+(4 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+CREATE INDEX ON distinct_killed (a, b, c, d);
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 5 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 1 | 1000 | 0 | 10
+(5 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a |  b   | c | d  
+---+------+---+----
+ 1 | 1000 | 0 | 10
+ 2 | 1000 | 0 | 10
+ 3 | 1000 | 0 | 10
+ 4 | 1000 | 0 | 10
+ 5 | 1000 | 0 | 10
+(5 rows)
+
+COMMIT;
+DROP TABLE distinct_killed;
+-- test posting lists
+CREATE TABLE distinct_posting (a int, b int, c int);
+CREATE INDEX ON distinct_posting (a, b, c);
+INSERT INTO distinct_posting
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+INSERT INTO distinct_posting (
+    SELECT 1 as a, 1 as b, 1 AS c
+        FROM generate_series(1,1000) i
+);
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c FROM distinct_posting WHERE c = 2
+    ORDER BY a DESC, b DESC;
+    FETCH ALL FROM c;
+ a | b | c 
+---+---+---
+ 3 | 2 | 2
+ 1 | 2 | 2
+(2 rows)
+
+    FETCH BACKWARD ALL FROM c;
+ a | b | c 
+---+---+---
+ 1 | 2 | 2
+ 3 | 2 | 2
+(2 rows)
+
+COMMIT;
+-- test that quals are check for indexability before applied
+CREATE TABLE Indexable_quals (a text, b text, c text);
+CREATE INDEX ON indexable_quals (a, b, c);
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'yyy');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'yyy');
+SELECT DISTINCT ON (a, b)  a, b
+FROM indexable_quals WHERE c LIKE '%y%' AND a LIKE 'a%' AND b = 'b';
+ a  | b 
+----+---
+ a1 | b
+ a2 | b
+(2 rows)
+
diff --git a/src/test/regress/sql/join.sql b/src/test/regress/sql/join.sql
index 6dd01b022e..972d58fd54 100644
--- a/src/test/regress/sql/join.sql
+++ b/src/test/regress/sql/join.sql
@@ -1585,11 +1585,16 @@ explain (costs off)
 select d.* from d left join (select * from b group by b.id, b.c_id) s
   on d.a = s.id;
 
+-- disable index skip scan to prevent it interfering with the plan
+set enable_indexskipscan to off;
+
 -- similarly, but keying off a DISTINCT clause
 explain (costs off)
 select d.* from d left join (select distinct * from b) s
   on d.a = s.id;
 
+set enable_indexskipscan to on;
+
 -- check join removal works when uniqueness of the join condition is enforced
 -- by a UNION
 explain (costs off)
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index f27ff714f8..c9ccf4cc7d 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -174,3 +174,285 @@ SELECT 1 IS NOT DISTINCT FROM 2 as "no";
 SELECT 2 IS NOT DISTINCT FROM 2 as "yes";
 SELECT 2 IS NOT DISTINCT FROM null as "no";
 SELECT null IS NOT DISTINCT FROM null as "yes";
+
+-- index only skip scan
+CREATE TABLE distinct_a (a int, b int, c int);
+INSERT INTO distinct_a (
+    SELECT five, tenthous, 10 FROM
+    generate_series(1, 5) five,
+    generate_series(1, 10000) tenthous
+);
+CREATE INDEX ON distinct_a (a, b);
+CREATE INDEX ON distinct_a ((a + 1));
+ANALYZE distinct_a;
+
+SELECT DISTINCT a FROM distinct_a;
+SELECT DISTINCT a FROM distinct_a WHERE a = 1;
+SELECT DISTINCT a FROM distinct_a ORDER BY a DESC;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a;
+
+-- test index skip scan with a condition on a non unique field
+SELECT DISTINCT ON (a) a, b FROM distinct_a WHERE b = 2;
+
+-- test index skip scan backwards
+SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
+
+-- test index skip scan for expressions
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT (a + 1) FROM distinct_a ORDER BY (a + 1);
+SELECT DISTINCT (a + 1) FROM distinct_a ORDER BY (a + 1);
+
+-- check colums order
+CREATE INDEX distinct_a_b_a on distinct_a (b, a);
+
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT on (a, b) a, b FROM distinct_a WHERE b = 2;
+
+DROP INDEX distinct_a_b_a;
+
+-- test opposite scan/index directions inside a cursor
+-- forward/backward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a, b;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- backward/forward
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b FROM distinct_a ORDER BY a DESC, b DESC;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+-- test missing values and skipping from the end
+CREATE TABLE distinct_abc(a int, b int, c int);
+CREATE INDEX ON distinct_abc(a, b, c);
+INSERT INTO distinct_abc
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+BEGIN;
+DECLARE c SCROLL CURSOR FOR
+SELECT DISTINCT ON (a) a,b,c FROM distinct_abc WHERE c = 2
+ORDER BY a DESC, b DESC;
+
+FETCH ALL FROM c;
+FETCH BACKWARD ALL FROM c;
+
+END;
+
+DROP TABLE distinct_abc;
+
+-- check colums order
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
+
+-- check projection case
+SELECT DISTINCT a, a FROM distinct_a WHERE b = 2;
+SELECT DISTINCT a, 1 FROM distinct_a WHERE b = 2;
+
+-- test cursor forward/backward movements
+BEGIN;
+DECLARE c SCROLL CURSOR FOR SELECT DISTINCT a FROM distinct_a;
+
+FETCH FROM c;
+FETCH BACKWARD FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+FETCH 6 FROM c;
+FETCH BACKWARD 6 FROM c;
+
+END;
+
+DROP TABLE distinct_a;
+
+-- test tuples visibility
+CREATE TABLE distinct_visibility (a int, b int);
+INSERT INTO distinct_visibility (select a, b from generate_series(1,5) a, generate_series(1, 10000) b);
+CREATE INDEX ON distinct_visibility (a, b);
+ANALYZE distinct_visibility;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 1;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DELETE FROM distinct_visibility WHERE a = 2 and b = 10000;
+SELECT DISTINCT ON (a) a, b FROM distinct_visibility ORDER BY a DESC, b DESC;
+DROP TABLE distinct_visibility;
+
+-- test page boundaries
+CREATE TABLE distinct_boundaries AS
+    SELECT a, b::int2 b, (b % 2)::int2 c FROM
+        generate_series(1, 5) a,
+        generate_series(1,366) b;
+
+CREATE INDEX ON distinct_boundaries (a, b, c);
+ANALYZE distinct_boundaries;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+SELECT DISTINCT ON (a) a, b, c from distinct_boundaries
+WHERE b >= 1 and c = 0 ORDER BY a, b;
+
+DROP TABLE distinct_boundaries;
+
+-- test tuple killing
+
+-- DESC ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- regular ordering
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed where a = 3;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a, b;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- partial delete
+CREATE TABLE distinct_killed AS
+    SELECT a, b, b % 2 AS c, 10 AS d
+        FROM generate_series(1, 5) a,
+             generate_series(1,1000) b;
+
+CREATE INDEX ON distinct_killed (a, b, c, d);
+
+DELETE FROM distinct_killed WHERE a = 3 AND b <= 999;
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c,d
+    FROM distinct_killed ORDER BY a DESC, b DESC;
+    FETCH FORWARD ALL FROM c;
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+DROP TABLE distinct_killed;
+
+-- test posting lists
+CREATE TABLE distinct_posting (a int, b int, c int);
+CREATE INDEX ON distinct_posting (a, b, c);
+INSERT INTO distinct_posting
+	VALUES (1, 1, 1),
+		   (1, 1, 2),
+		   (1, 2, 2),
+		   (1, 2, 3),
+		   (2, 2, 1),
+		   (2, 2, 3),
+		   (3, 1, 1),
+		   (3, 1, 2),
+		   (3, 2, 2),
+		   (3, 2, 3);
+
+INSERT INTO distinct_posting (
+    SELECT 1 as a, 1 as b, 1 AS c
+        FROM generate_series(1,1000) i
+);
+
+BEGIN;
+    DECLARE c SCROLL CURSOR FOR
+    SELECT DISTINCT ON (a) a,b,c FROM distinct_posting WHERE c = 2
+    ORDER BY a DESC, b DESC;
+    FETCH ALL FROM c;
+
+    FETCH BACKWARD ALL FROM c;
+COMMIT;
+
+-- test that quals are check for indexability before applied
+CREATE TABLE Indexable_quals (a text, b text, c text);
+CREATE INDEX ON indexable_quals (a, b, c);
+
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a1', 'b', 'yyy');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'xxx');
+INSERT INTO indexable_quals VALUES ('a2', 'b', 'yyy');
+
+SELECT DISTINCT ON (a, b)  a, b
+FROM indexable_quals WHERE c LIKE '%y%' AND a LIKE 'a%' AND b = 'b';
-- 
2.32.0

v41-0004-Extend-amskip-implementation-for-Btree.patchtext/x-diff; charset=us-asciiDownload

From eab94f627c1cbddcc8dec2f1553f28bf3007dab5 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Thu, 20 May 2021 21:17:51 +0200
Subject: [PATCH v41 4/6] Extend amskip implementation for Btree

Add support for backward scan to Btree amskip implementation. This will
make index skip scan work without Material node in case of scrolling cursor.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 src/backend/access/index/indexam.c       |   6 +-
 src/backend/access/nbtree/nbtree.c       |   5 +-
 src/backend/access/nbtree/nbtsearch.c    | 302 ++++++++++++++++++++++-
 src/backend/executor/execAmi.c           |  32 +--
 src/backend/executor/nodeIndexonlyscan.c |  57 ++++-
 src/include/access/amapi.h               |   1 +
 src/include/access/genam.h               |   3 +-
 src/include/access/nbtree.h              |   6 +-
 8 files changed, 371 insertions(+), 41 deletions(-)

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index bcf7c73467..9fa5db27ea 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -748,11 +748,13 @@ index_can_return(Relation indexRelation, int attno)
  * ----------------
  */
 bool
-index_skip(IndexScanDesc scan, ScanDirection direction, int prefix)
+index_skip(IndexScanDesc scan, ScanDirection direction,
+		   ScanDirection indexdir, bool scanstart, int prefix)
 {
 	SCAN_CHECKS;
 
-	return scan->indexRelation->rd_indam->amskip(scan, direction, prefix);
+	return scan->indexRelation->rd_indam->amskip(scan, direction,
+												 indexdir, prefix);
 }
 
 /* ----------------
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 7b2aa594fb..6a627de65a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -449,9 +449,10 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
  * btskip() -- skip to the beginning of the next key prefix
  */
 bool
-btskip(IndexScanDesc scan, ScanDirection direction, int prefix)
+btskip(IndexScanDesc scan, ScanDirection direction,
+	   ScanDirection indexdir, int prefix)
 {
-	return _bt_skip(scan, direction, prefix);
+	return _bt_skip(scan, direction, indexdir, prefix);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 0ad761916a..00742c7e21 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -1509,12 +1509,31 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * 		The current position is set so that a subsequent call to _bt_next will
  * 		fetch the first tuple that differs in the leading 'prefix' keys.
  *
- * 		The current page is searched for the next unique value. If none is found
- * 		we will do a scan from the root in order to find the next page with
- * 		a unique value.
+ * 		There are four different kinds of skipping (depending on dir and
+ * 		indexdir, that are important to distinguish, especially in the presense
+ * 		of an index condition:
+ *
+ * 		* Advancing forward and reading forward
+ * 			simple scan
+ *
+ * 		* Advancing forward and reading backward
+ * 			scan inside a cursor fetching backward, when skipping is necessary
+ * 			right from the start
+ *
+ * 		* Advancing backward and reading forward
+ * 			scan with order by desc inside a cursor fetching forward, when
+ * 			skipping is necessary right from the start
+ *
+ * 		* Advancing backward and reading backward
+ * 			simple scan with order by desc
+ *
+ *      The current page is searched for the next unique value. If none is found
+ *      we will do a scan from the root in order to find the next page with
+ *      a unique value.
  */
 bool
-_bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
+_bt_skip(IndexScanDesc scan, ScanDirection dir,
+		 ScanDirection indexdir, int prefix)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTStack stack;
@@ -1625,11 +1644,282 @@ _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix)
 	 * to go one step back, since we need a last element from the previous
 	 * series.
 	 */
-	if (ScanDirectionIsBackward(dir) || (ScanDirectionIsForward(dir) & scanstart))
+	if ((ScanDirectionIsBackward(dir) && ScanDirectionIsBackward(indexdir)) ||
+		(ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir) & scanstart))
 		 offnum = OffsetNumberPrev(offnum);
 
+	/*
+	 * Andvance backward but read forward. At this moment we are at the next
+	 * distinct key at the beginning of the series. In case if scan just
+	 * started, we can read forward without doing anything else. Otherwise
+	 * find previous distinct key and the beginning of it's series and read
+	 * forward from there. To do so, go back one step, perform binary search
+	 * to find the first item in the series and let _bt_readpage do everything
+	 * else.
+	 */
+	else if (ScanDirectionIsBackward(dir) && ScanDirectionIsForward(indexdir) && !scanstart)
+	{
+		/* Reading forward means we expect to see more data on the right */
+		so->currPos.moreRight = true;
+
+		/* One step back to find a previous value */
+		if (!_bt_readpage(scan, dir, offnum) ||
+		 	--so->currPos.itemIndex < so->currPos.firstItem)
+		{
+			_bt_unlockbuf(indexRel, so->currPos.buf);
+			if (!_bt_steppage(scan, dir))
+			{
+				pfree(so->skipScanKey);
+				so->skipScanKey = NULL;
+				return false;
+			}
+		}
+		else
+			_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+		currItem = &so->currPos.items[so->currPos.itemIndex];
+		scan->xs_heaptid = currItem->heapTid;
+		if (scan->xs_want_itup)
+			scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+		_bt_update_skip_scankeys(scan, indexRel);
+
+		/*
+		 * And now find the last item from the sequence for the
+		 * current, value with the intention do OffsetNumberNext. As a
+		 * result we end up on a first element from the sequence.
+		 */
+		if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf))
+			offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+		else
+		{
+			/* Before leaving current page, deal with any killed items */
+			if (so->numKilled > 0)
+				_bt_killitems(scan);
+
+			ReleaseBuffer(so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+
+			stack = _bt_search(scan->indexRelation, so->skipScanKey,
+							   &buf, BT_READ, scan->xs_snapshot);
+			_bt_freestack(stack);
+			so->currPos.buf = buf;
+			offnum = _bt_binsrch(scan->indexRelation, so->skipScanKey, buf);
+		}
+	}
+
+	/*
+	 * Advance forward but read backward. At this moment we are at the next
+	 * distinct key at the beginning of the series. In case if scan just
+	 * started, we can go one step back and read forward without doing
+	 * anything else. Otherwise find the next distinct key and the beginning
+	 * of it's series, go one step back and read backward from there.
+	 *
+	 * An interesting situation can happen if one of distinct keys do not pass
+	 * a corresponding index condition at all. In this case reading backward
+	 * can lead to a previous distinct key being found, creating a loop. To
+	 * avoid that check the value to be returned, and jump one more time if
+	 * it's the same as at the beginning. Note that we do not check visibility
+	 * here, and dead tuples could also lead to the same situation. This has to
+	 * be checked on the caller side.
+	 */
+	else if (ScanDirectionIsForward(dir) && ScanDirectionIsBackward(indexdir) && !scanstart)
+	{
+		IndexTuple 	startItup = CopyIndexTuple(scan->xs_itup);
+		bool 		nextFound = false;
+
+		/* Reading backwards means we expect to see more data on the left */
+		so->currPos.moreLeft = true;
+
+		for (;;)
+		{
+			IndexTuple itup;
+			OffsetNumber jumpOffset;
+
+			if (nextFound)
+				break;
+
+			/*
+			 * Find a next index tuple to update scan key. It could be at
+			 * the end, so check for max offset
+			 */
+			if (!_bt_readpage(scan, ForwardScanDirection, offnum))
+			{
+				/*
+				 * There's no actually-matching data on this page.  Try to
+				 * advance to the next page. Return false if there's no
+				 * matching data at all.
+				 */
+				_bt_unlockbuf(indexRel, so->currPos.buf);
+				if (!_bt_steppage(scan, dir))
+				{
+					pfree(so->skipScanKey);
+					so->skipScanKey = NULL;
+					return false;
+				}
+			}
+			else
+				_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+			/* check for interrupts while we're not holding any buffer lock */
+			CHECK_FOR_INTERRUPTS();
+
+			currItem = &so->currPos.items[so->currPos.firstItem];
+			itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+			scan->xs_itup = itup;
+
+			_bt_update_skip_scankeys(scan, indexRel);
+
+			/* Before leaving current page, deal with any killed items */
+			if (so->numKilled > 0)
+				_bt_killitems(scan);
+
+			ReleaseBuffer(so->currPos.buf);
+			so->currPos.buf = InvalidBuffer;
+
+			stack = _bt_search(scan->indexRelation, so->skipScanKey,
+							   &buf, BT_READ, scan->xs_snapshot);
+			_bt_freestack(stack);
+			so->currPos.buf = buf;
+
+			/*
+			 * We need to remember the original offset after the jump,
+			 * since in case of looping this would be the next starting
+			 * point
+			 */
+			jumpOffset = offnum = _bt_binsrch(scan->indexRelation,
+											  so->skipScanKey, buf);
+			offnum = OffsetNumberPrev(offnum);
+
+			if (!_bt_readpage(scan, indexdir, offnum))
+			{
+				/*
+				 * There's no actually-matching data on this page.  Try to
+				 * advance to the next page. Return false if there's no
+				 * matching data at all.
+				 */
+				_bt_unlockbuf(indexRel, so->currPos.buf);
+				if (!_bt_steppage(scan, indexdir))
+				{
+					pfree(so->skipScanKey);
+					so->skipScanKey = NULL;
+					return false;
+				}
+				_bt_lockbuf(indexRel, so->currPos.buf, BT_READ);
+			}
+
+			currItem = &so->currPos.items[so->currPos.lastItem];
+			itup = CopyIndexTuple((IndexTuple)
+					(so->currTuples + currItem->tupleOffset));
+
+			/*
+			 * To check if we returned the same tuple, try to find a
+			 * startItup on the current page. For that we need to update
+			 * scankey to match the whole tuple and set nextkey to return
+			 * an exact tuple, not the next one. If the tuple we found in
+			 * this way is equal to what we wanted to return, it means we
+			 * are in the loop, return offnum to the original position and
+			 * jump further
+			 *
+			 * Note that to compare tids we need to keep the leaf pinned,
+			 * otherwise there is a danger of vacuum cleaning up relevant
+			 * tuples.
+			 */
+			scan->xs_itup = startItup;
+			_bt_update_skip_scankeys(scan, indexRel);
+
+			so->skipScanKey->keysz = IndexRelationGetNumberOfKeyAttributes(indexRel);
+			so->skipScanKey->nextkey = false;
+
+			if (_bt_scankey_within_page(scan, so->skipScanKey, so->currPos.buf))
+			{
+				OffsetNumber maxoff, startOffset;
+				IndexTuple verifiedItup;
+				Page page = BufferGetPage(so->currPos.buf);
+				startOffset = _bt_binsrch(scan->indexRelation,
+										  so->skipScanKey,
+										  so->currPos.buf);
+
+				maxoff = PageGetMaxOffsetNumber(page);
+
+				/* Now read the data */
+				if (_bt_readpage(scan, ForwardScanDirection, startOffset))
+				{
+					ItemPointer resultTids, verifyTids;
+					int nresult = 1,
+						nverify = 1;
+
+					currItem = &so->currPos.items[so->currPos.itemIndex];
+					verifiedItup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
+
+					/*
+					 * We need to keep in mind that tuples we deal with
+					 * could be also posting tuples and represent a list of
+					 * tids.
+					 */
+					if (BTreeTupleIsPosting(verifiedItup))
+					{
+						nverify = BTreeTupleGetNPosting(verifiedItup);
+						verifyTids = BTreeTupleGetPosting(verifiedItup);
+						for (int i = 1; i < nverify; i++)
+							verifyTids[i] = *BTreeTupleGetPostingN(verifiedItup, i);
+					}
+					else
+						verifyTids = &verifiedItup->t_tid;
+
+					if (BTreeTupleIsPosting(itup))
+					{
+						nresult = BTreeTupleGetNPosting(itup);
+						resultTids = BTreeTupleGetPosting(itup);
+						for (int i = 1; i < nresult; i++)
+							resultTids[i] = *BTreeTupleGetPostingN(itup, i);
+					}
+					else
+						resultTids = &itup->t_tid;
+
+					/* One not equal means they're not equal. */
+					for(int i = 0; i < nverify; i++)
+					{
+						for(int j = 0; j < nresult; j++)
+						{
+							if (!ItemPointerEquals(&resultTids[j], &verifyTids[i]))
+							{
+								nextFound = true;
+								break;
+							}
+						}
+					}
+
+					if (!nextFound)
+						offnum = jumpOffset;
+				}
+
+				if ((offnum > maxoff) && (so->currPos.nextPage == P_NONE))
+				{
+					_bt_relbuf(indexRel, so->currPos.buf);
+					BTScanPosInvalidate(so->currPos);
+
+					pfree(so->skipScanKey);
+					so->skipScanKey = NULL;
+					return false;
+				}
+			}
+			else
+				/*
+				 * If startItup could be not found within the current page,
+				 * assume we found something new
+				 */
+				nextFound = true;
+
+			/* Return original scankey options */
+			so->skipScanKey->keysz = prefix;
+			so->skipScanKey->nextkey = ScanDirectionIsForward(dir);
+		}
+	}
+
 	/* Now read the data */
-	if (!_bt_readpage(scan, dir, offnum))
+	if (!_bt_readpage(scan, indexdir, offnum))
 	{
 		/*
 		 * There's no actually-matching data on this page.  Try to advance to
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index ced8933f44..b6245994f0 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -64,7 +64,7 @@
 #include "utils/rel.h"
 #include "utils/syscache.h"
 
-static bool IndexSupportsBackwardScan(Plan *node);
+static bool IndexSupportsBackwardScan(Oid indexid);
 
 
 /*
@@ -555,8 +555,10 @@ ExecSupportsBackwardScan(Plan *node)
 			return false;
 
 		case T_IndexScan:
+			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid);
+
 		case T_IndexOnlyScan:
-			return IndexSupportsBackwardScan(node);
+			return IndexSupportsBackwardScan(((IndexOnlyScan *) node)->indexid);
 
 		case T_SubqueryScan:
 			return ExecSupportsBackwardScan(((SubqueryScan *) node)->subplan);
@@ -596,38 +598,16 @@ ExecSupportsBackwardScan(Plan *node)
 
 /*
  * An IndexScan or IndexOnlyScan node supports backward scan only if the
- * index's AM does and no skip scan is used.
+ * index's AM does.
  */
 static bool
-IndexSupportsBackwardScan(Plan *node)
+IndexSupportsBackwardScan(Oid indexid)
 {
 	bool		result;
-	Oid			indexid = InvalidOid;
-	int         skip_prefix_size = 0;
 	HeapTuple	ht_idxrel;
 	Form_pg_class idxrelrec;
 	IndexAmRoutine *amroutine;
 
-	Assert(IsA(node, IndexScan) || IsA(node, IndexOnlyScan));
-	switch(nodeTag(node))
-	{
-		case T_IndexScan:
-			indexid = ((IndexScan *) node)->indexid;
-			break;
-
-		case T_IndexOnlyScan:
-			indexid = ((IndexOnlyScan *) node)->indexid;
-			skip_prefix_size = ((IndexOnlyScan *) node)->indexskipprefixsize;
-			break;
-
-		default:
-			elog(DEBUG2, "unrecognized node type: %d", (int) nodeTag(node));
-			break;
-	}
-
-	if (skip_prefix_size > 0)
-		return false;
-
 	/* Fetch the pg_class tuple of the index relation */
 	ht_idxrel = SearchSysCache1(RELOID, ObjectIdGetDatum(indexid));
 	if (!HeapTupleIsValid(ht_idxrel))
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 40ad1b949b..d5abac20cb 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -67,6 +67,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
 	ItemPointer tid = NULL;
+	ItemPointerData startTid;
 	IndexOnlyScan *indexonlyscan = (IndexOnlyScan *) node->ss.ps.plan;
 
 	/*
@@ -75,6 +76,14 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	 */
 	bool skipped = false;
 
+	/*
+	 * Index only scan must be aware that in case of skipping we can return to
+	 * the starting point due to visibility checks. In this situation we need
+	 * to jump further, and number of skipping attempts tell us how far do we
+	 * need to do so.
+	 */
+	int skipAttempts = 0;
+
 	/*
 	 * extract necessary information from index scan node
 	 */
@@ -123,13 +132,27 @@ IndexOnlyNext(IndexOnlyScanState *node)
 						 node->ioss_OrderByKeys,
 						 node->ioss_NumOrderByKeys);
 	}
+	else
+	{
+		ItemPointerCopy(&scandesc->xs_heaptid, &startTid);
+	}
 
 	/*
 	 * Check if we need to skip to the next key prefix.
+	 *
+	 * When fetching a cursor in the direction opposite to a general scan
+	 * direction, the result must be what normal fetching should have
+	 * returned, but in reversed order. In other words, return the last or
+	 * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+	 * Due to that we skip also when the first tuple wasn't emitted yet, but
+	 * the directions are opposite.
 	 */
-	if (node->ioss_SkipPrefixSize > 0 && node->ioss_FirstTupleEmitted)
+	if (node->ioss_SkipPrefixSize > 0 &&
+		(node->ioss_FirstTupleEmitted ||
+		 ScanDirectionsAreOpposite(direction, indexonlyscan->indexorderdir)))
 	{
-		if (!index_skip(scandesc, direction, node->ioss_SkipPrefixSize))
+		if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir,
+						!node->ioss_FirstTupleEmitted, node->ioss_SkipPrefixSize))
 		{
 			/*
 			 * Reached end of index. At this point currPos is invalidated, and
@@ -144,6 +167,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 		else
 		{
 			skipped = true;
+			skipAttempts = 1;
 			tid = &scandesc->xs_heaptid;
 		}
 	}
@@ -161,6 +185,35 @@ IndexOnlyNext(IndexOnlyScanState *node)
 
 		skipped = false;
 
+		/*
+		 * If we already emitted first tuple, while doing index only skip scan
+		 * with advancing and reading in different directions we can return to
+		 * the same position where we started after visibility check. Recognize
+		 * such situations and skip more.
+		 */
+		if ((readDirection != direction) && node->ioss_FirstTupleEmitted &&
+			ItemPointerIsValid(&startTid) && ItemPointerEquals(&startTid, tid))
+		{
+			int i;
+			skipAttempts += 1;
+
+			for (i = 0; i < skipAttempts; i++)
+			{
+				if (!index_skip(scandesc, direction,
+								indexonlyscan->indexorderdir,
+								!node->ioss_FirstTupleEmitted,
+								node->ioss_SkipPrefixSize))
+				{
+					node->ioss_FirstTupleEmitted = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			tid = &scandesc->xs_heaptid;
+		}
+
+		skipped = false;
+
 		/*
 		 * We can skip the heap fetch if the TID references a heap page on
 		 * which all tuples are known visible to everybody.  In any case,
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index cb2f48a1bc..58e29d054c 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -176,6 +176,7 @@ typedef bool (*amgettuple_function) (IndexScanDesc scan,
 /* skip past duplicates in a given prefix */
 typedef bool (*amskip_function) (IndexScanDesc scan,
 								 ScanDirection dir,
+								 ScanDirection indexdir,
 								 int prefix);
 
 /* fetch all valid tuples */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index d13d95c458..5a6d904af6 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -183,7 +183,8 @@ extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *index_vacuum_cleanup(IndexVacuumInfo *info,
 												   IndexBulkDeleteResult *istat);
 extern bool index_can_return(Relation indexRelation, int attno);
-extern bool index_skip(IndexScanDesc scan, ScanDirection direction, int prefix);
+extern bool index_skip(IndexScanDesc scan, ScanDirection direction,
+					   ScanDirection indexdir, bool start, int prefix);
 extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 2c516654c2..039e8d1f0d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1232,7 +1232,8 @@ extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
-extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir, int prefix);
+extern bool _bt_skip(IndexScanDesc scan, ScanDirection dir,
+					 ScanDirection indexdir, int prefix);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
 							   Snapshot snapshot);
 
@@ -1257,7 +1258,8 @@ extern void _bt_end_vacuum_callback(int code, Datum arg);
 extern Size BTreeShmemSize(void);
 extern void BTreeShmemInit(void);
 extern bytea *btoptions(Datum reloptions, bool validate);
-extern bool btskip(IndexScanDesc scan, ScanDirection dir, int prefix);
+extern bool btskip(IndexScanDesc scan, ScanDirection dir,
+				   ScanDirection indexdir, int prefix);
 extern bool btproperty(Oid index_oid, int attno,
 					   IndexAMProperty prop, const char *propname,
 					   bool *res, bool *isnull);
-- 
2.32.0

v41-0005-Extend-index-skip-scan-with-ScanLooseKey.patchtext/x-diff; charset=us-asciiDownload

From 6fc50a6a405b8232a935a3d2459b62dec09c570c Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Sat, 22 Jan 2022 21:13:42 +0100
Subject: [PATCH v41 5/6] Extend index skip scan with ScanLooseKey

Index skip scan relies on the information about key prefix that needs to
be jumped over, but it's represented in a rather limited fashion only
via the prefix size. This approach is sufficient now, but for the sake
of flexibility introduce a concept of ScanLooseKey to represent
underspecified search keys. At the moment it's inspired by the idea of
skip keys constrained in some range of keyspace, and not used in any
way.
---
 src/backend/executor/nodeIndexonlyscan.c | 14 ++++++++++----
 src/include/access/relscan.h             |  2 ++
 src/include/access/skey.h                |  8 ++++++++
 src/include/nodes/execnodes.h            |  5 ++++-
 4 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index d5abac20cb..470c364e53 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -147,12 +147,12 @@ IndexOnlyNext(IndexOnlyScanState *node)
 	 * Due to that we skip also when the first tuple wasn't emitted yet, but
 	 * the directions are opposite.
 	 */
-	if (node->ioss_SkipPrefixSize > 0 &&
+	if (node->ioss_ScanLooseKeys != NULL &&
 		(node->ioss_FirstTupleEmitted ||
 		 ScanDirectionsAreOpposite(direction, indexonlyscan->indexorderdir)))
 	{
 		if (!index_skip(scandesc, direction, indexonlyscan->indexorderdir,
-						!node->ioss_FirstTupleEmitted, node->ioss_SkipPrefixSize))
+						!node->ioss_FirstTupleEmitted, node->ioss_NumScanLooseKeys))
 		{
 			/*
 			 * Reached end of index. At this point currPos is invalidated, and
@@ -202,7 +202,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
 				if (!index_skip(scandesc, direction,
 								indexonlyscan->indexorderdir,
 								!node->ioss_FirstTupleEmitted,
-								node->ioss_SkipPrefixSize))
+								node->ioss_NumScanLooseKeys))
 				{
 					node->ioss_FirstTupleEmitted = false;
 					return ExecClearTuple(slot);
@@ -594,7 +594,6 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
-	indexstate->ioss_SkipPrefixSize = node->indexskipprefixsize;
 	indexstate->ioss_FirstTupleEmitted = false;
 
 	/*
@@ -697,6 +696,13 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 						   NULL,	/* no ArrayKeys */
 						   NULL);
 
+	if (node->indexskipprefixsize != 0)
+	{
+		indexstate->ioss_NumScanLooseKeys = node->indexskipprefixsize;
+		indexstate->ioss_ScanLooseKeys =
+			(ScanLooseKey) palloc(node->indexskipprefixsize * sizeof(ScanLooseKeyData));
+	}
+
 	/*
 	 * If we have runtime keys, we need an ExprContext to evaluate them. The
 	 * node's standard context won't do because we want to reset that context
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 53a93ccbe7..5200a1867d 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -119,8 +119,10 @@ typedef struct IndexScanDescData
 	struct SnapshotData *xs_snapshot;	/* snapshot to see */
 	int			numberOfKeys;	/* number of index qualifier conditions */
 	int			numberOfOrderBys;	/* number of ordering operators */
+	int			numberOfLooseKeys;	/* number of loose index qualifier conditions */
 	struct ScanKeyData *keyData;	/* array of index qualifier descriptors */
 	struct ScanKeyData *orderByData;	/* array of ordering op descriptors */
+	struct ScanLooseKeyData *looseKeyData;	/* array of loose index qualifier descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
 	bool		xs_temp_snap;	/* unregister snapshot at scan end? */
 
diff --git a/src/include/access/skey.h b/src/include/access/skey.h
index b5ab17f7d9..a711dc353d 100644
--- a/src/include/access/skey.h
+++ b/src/include/access/skey.h
@@ -74,6 +74,14 @@ typedef struct ScanKeyData
 
 typedef ScanKeyData *ScanKey;
 
+typedef struct ScanLooseKeyData
+{
+	ScanKey start;
+	ScanKey end;
+} ScanLooseKeyData;
+
+typedef ScanLooseKeyData *ScanLooseKey;
+
 /*
  * About row comparisons:
  *
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cf36e6c0e6..fa6ee25bc6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1508,6 +1508,8 @@ typedef struct IndexScanState
  *		NumOrderByKeys	   number of OrderByKeys
  *		RuntimeKeys		   info about Skeys that must be evaluated at runtime
  *		NumRuntimeKeys	   number of RuntimeKeys
+ *		ScanLooseKeys 	   Skey structures for loose index quals
+ *		NumScanLooseKeys   number of ScanLooseKeys
  *		RuntimeKeysReady   true if runtime Skeys have been computed
  *		RuntimeContext	   expr context for evaling runtime Skeys
  *		RelationDesc	   index relation descriptor
@@ -1527,13 +1529,14 @@ typedef struct IndexOnlyScanState
 	int			ioss_NumOrderByKeys;
 	IndexRuntimeKeyInfo *ioss_RuntimeKeys;
 	int			ioss_NumRuntimeKeys;
+	struct ScanLooseKeyData *ioss_ScanLooseKeys;
+	int			ioss_NumScanLooseKeys;
 	bool		ioss_RuntimeKeysReady;
 	ExprContext *ioss_RuntimeContext;
 	Relation	ioss_RelationDesc;
 	struct IndexScanDescData *ioss_ScanDesc;
 	TupleTableSlot *ioss_TableSlot;
 	Buffer		ioss_VMBuffer;
-	int         ioss_SkipPrefixSize;
 	bool		ioss_FirstTupleEmitted;
 	Size		ioss_PscanLen;
 } IndexOnlyScanState;
-- 
2.32.0

v41-0006-Index-skip-scan-for-IndexScan.patchtext/x-diff; charset=us-asciiDownload

From 7cdd4cc9621a5818c38b27f26c15c570ab4e83f6 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthalion6@gmail.com>
Date: Fri, 14 May 2021 19:22:06 +0200
Subject: [PATCH v41 6/6] Index skip scan for IndexScan

Introduce Skip Scan support for IndexScan, not only for IndexOnlyScan.
It works in the same way as IndexOnlyScan, but planned has to check that
the chosen index is fully covering specified distinct expressions.

Author: Jesper Pedersen, Dmitry Dolgov
Reviewed-by: Thomas Munro, David Rowley, Floris Van Nee, Kyotaro Horiguchi, Tomas Vondra, Peter Geoghegan
---
 src/backend/commands/explain.c                |  6 ++
 src/backend/executor/nodeIndexscan.c          | 56 +++++++++++++++-
 src/backend/nodes/copyfuncs.c                 |  1 +
 src/backend/nodes/outfuncs.c                  |  1 +
 src/backend/nodes/readfuncs.c                 |  1 +
 src/backend/optimizer/path/indxpath.c         | 59 ++++++++++++++++-
 src/backend/optimizer/plan/createplan.c       | 10 ++-
 src/include/nodes/execnodes.h                 |  4 ++
 src/include/nodes/plannodes.h                 |  2 +
 src/test/regress/expected/select_distinct.out | 64 ++++++++++++++++---
 src/test/regress/sql/select_distinct.sql      | 16 +++++
 11 files changed, 206 insertions(+), 14 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 4f5bd1d678..67d072b010 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1754,6 +1754,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	switch (nodeTag(plan))
 	{
 		case T_IndexScan:
+			if (((IndexScan *) plan)->indexskipprefixsize > 0)
+			{
+				IndexScan  *indexscan = (IndexScan *) plan;
+				ExplainPropertyBool("Skip scan", true, es);
+				ExplainIndexSkipScanKeys(indexscan->indexskipprefixsize, es);
+			}
 			show_scan_qual(((IndexScan *) plan)->indexqualorig,
 						   "Index Cond", planstate, ancestors, es);
 			if (((IndexScan *) plan)->indexqualorig)
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 90b2699a96..90aad76b95 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,6 +85,13 @@ IndexNext(IndexScanState *node)
 	ScanDirection direction;
 	IndexScanDesc scandesc;
 	TupleTableSlot *slot;
+	IndexScan *indexscan = (IndexScan *) node->ss.ps.plan;
+
+	/*
+	 * tells if the current position was reached via skipping. In this case
+	 * there is no nead for the index_getnext_tid
+	 */
+	bool skipped = false;
 
 	/*
 	 * extract necessary information from index scan node
@@ -92,7 +99,7 @@ IndexNext(IndexScanState *node)
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
 	/* flip direction if this is an overall backward scan */
-	if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
+	if (ScanDirectionIsBackward(indexscan->indexorderdir))
 	{
 		if (ScanDirectionIsForward(direction))
 			direction = BackwardScanDirection;
@@ -117,6 +124,12 @@ IndexNext(IndexScanState *node)
 
 		node->iss_ScanDesc = scandesc;
 
+		/* Index skip scan assumes xs_want_itup, so set it to true */
+		if (indexscan->indexskipprefixsize > 0)
+			node->iss_ScanDesc->xs_want_itup = true;
+		else
+			node->iss_ScanDesc->xs_want_itup = false;
+
 		/*
 		 * If no run-time keys to calculate or they are ready, go ahead and
 		 * pass the scankeys to the index AM.
@@ -127,12 +140,48 @@ IndexNext(IndexScanState *node)
 						 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
 	}
 
+	/*
+	 * Check if we need to skip to the next key prefix, because we've been
+	 * asked to implement DISTINCT.
+	 *
+	 * When fetching a cursor in the direction opposite to a general scan
+	 * direction, the result must be what normal fetching should have
+	 * returned, but in reversed order. In other words, return the last or
+	 * first scanned tuple in a DISTINCT set, depending on a cursor direction.
+	 * Due to that we skip also when the first tuple wasn't emitted yet, but
+	 * the directions are opposite.
+	 */
+	if (node->iss_SkipPrefixSize > 0 &&
+		(node->iss_FirstTupleEmitted ||
+		 ScanDirectionsAreOpposite(direction, indexscan->indexorderdir)))
+	{
+		if (!index_skip(scandesc, direction, indexscan->indexorderdir,
+					   !node->iss_FirstTupleEmitted, node->iss_SkipPrefixSize))
+		{
+			/*
+			 * Reached end of index. At this point currPos is invalidated, and
+			 * we need to reset iss_FirstTupleEmitted, since otherwise after
+			 * going backwards, reaching the end of index, and going forward
+			 * again we apply skip again. It would be incorrect and lead to an
+			 * extra skipped item.
+			 */
+			node->iss_FirstTupleEmitted = false;
+			return ExecClearTuple(slot);
+		}
+		else
+		{
+			skipped = true;
+			index_fetch_heap(scandesc, slot);
+		}
+	}
+
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while (skipped || index_getnext_slot(scandesc, direction, slot))
 	{
 		CHECK_FOR_INTERRUPTS();
+		skipped = false;
 
 		/*
 		 * If the index was lossy, we have to recheck the index quals using
@@ -149,6 +198,7 @@ IndexNext(IndexScanState *node)
 			}
 		}
 
+		node->iss_FirstTupleEmitted = true;
 		return slot;
 	}
 
@@ -914,6 +964,8 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
 	indexstate->ss.ps.ExecProcNode = ExecIndexScan;
+	indexstate->iss_SkipPrefixSize = node->indexskipprefixsize;
+	indexstate->iss_FirstTupleEmitted = false;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index fe0d92ad46..8d0f92493f 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -497,6 +497,7 @@ _copyIndexScan(const IndexScan *from)
 	COPY_NODE_FIELD(indexorderbyorig);
 	COPY_NODE_FIELD(indexorderbyops);
 	COPY_SCALAR_FIELD(indexorderdir);
+	COPY_SCALAR_FIELD(indexskipprefixsize);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 06eb1a89e9..ad5c223bf8 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -569,6 +569,7 @@ _outIndexScan(StringInfo str, const IndexScan *node)
 	WRITE_NODE_FIELD(indexorderbyorig);
 	WRITE_NODE_FIELD(indexorderbyops);
 	WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+	WRITE_INT_FIELD(indexskipprefixsize);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 169fb408d3..0fefbd5cc8 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1868,6 +1868,7 @@ _readIndexScan(void)
 	READ_NODE_FIELD(indexorderbyorig);
 	READ_NODE_FIELD(indexorderbyops);
 	READ_ENUM_FIELD(indexorderdir, ScanDirection);
+	READ_INT_FIELD(indexskipprefixsize);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 00526f3476..41444d2216 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -1036,7 +1036,7 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 					   check_index_only(rel, index));
 
 	/* Check if an index skip scan is possible. */
-	can_skip = enable_indexskipscan & index->amcanskip & index_only_scan;
+	can_skip = enable_indexskipscan & index->amcanskip;
 
 	if (can_skip)
 	{
@@ -1099,6 +1099,63 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				}
 			}
 		}
+
+		/*
+		 * For an index scan verify that index fully covers distinct
+		 * expressions, otherwise there is not enough information for skipping
+		 */
+		if (!index_only_scan && root->query_uniquekeys != NULL)
+		{
+			ListCell *lc;
+
+			foreach(lc, root->query_uniquekeys)
+			{
+				List *uniqExprs = (List *) lfirst(lc);
+				ListCell *lc1;
+
+				foreach(lc1, uniqExprs)
+				{
+					Expr *expr = (Expr *) lfirst(lc1);
+					bool found = false;
+
+					if (!IsA(expr, Var))
+					{
+						ListCell *lc2;
+
+						foreach(lc2, index->indexprs)
+						{
+							if(equal(lfirst(lc1), lfirst(lc2)))
+							{
+								found = true;
+								break;
+							}
+						}
+					}
+					else
+					{
+						Var *var = (Var *) expr;
+
+						for (int i = 0; i < index->ncolumns; i++)
+						{
+							if (index->indexkeys[i] == var->varattno)
+							{
+								found = true;
+								break;
+							}
+						}
+					}
+
+					if (!found)
+					{
+						can_skip = false;
+						break;
+					}
+				}
+
+				if (!can_skip)
+					break;
+			}
+		}
 	}
 
 	/*
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 511dda3a9f..618d1833ce 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -185,7 +185,8 @@ static IndexScan *make_indexscan(List *qptlist, List *qpqual, Index scanrelid,
 								 Oid indexid, List *indexqual, List *indexqualorig,
 								 List *indexorderby, List *indexorderbyorig,
 								 List *indexorderbyops,
-								 ScanDirection indexscandir);
+								 ScanDirection indexscandir,
+								 int skipprefix);
 static IndexOnlyScan *make_indexonlyscan(List *qptlist, List *qpqual,
 										 Index scanrelid, Oid indexid,
 										 List *indexqual, List *recheckqual,
@@ -3121,7 +3122,8 @@ create_indexscan_plan(PlannerInfo *root,
 											fixed_indexorderbys,
 											indexorderbys,
 											indexorderbyops,
-											best_path->indexscandir);
+											best_path->indexscandir,
+											best_path->indexskipprefix);
 
 	copy_generic_path_info(&scan_plan->plan, &best_path->path);
 
@@ -5454,7 +5456,8 @@ make_indexscan(List *qptlist,
 			   List *indexorderby,
 			   List *indexorderbyorig,
 			   List *indexorderbyops,
-			   ScanDirection indexscandir)
+			   ScanDirection indexscandir,
+			   int skipPrefixSize)
 {
 	IndexScan  *node = makeNode(IndexScan);
 	Plan	   *plan = &node->scan.plan;
@@ -5471,6 +5474,7 @@ make_indexscan(List *qptlist,
 	node->indexorderbyorig = indexorderbyorig;
 	node->indexorderbyops = indexorderbyops;
 	node->indexorderdir = indexscandir;
+	node->indexskipprefixsize = skipPrefixSize;
 
 	return node;
 }
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index fa6ee25bc6..0cf02c1cd7 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1486,6 +1486,8 @@ typedef struct IndexScanState
 	ExprContext *iss_RuntimeContext;
 	Relation	iss_RelationDesc;
 	struct IndexScanDescData *iss_ScanDesc;
+	int         iss_SkipPrefixSize;
+	bool		iss_FirstTupleEmitted;
 
 	/* These are needed for re-checking ORDER BY expr ordering */
 	pairingheap *iss_ReorderQueue;
@@ -1517,6 +1519,8 @@ typedef struct IndexScanState
  *		TableSlot		   slot for holding tuples fetched from the table
  *		VMBuffer		   buffer in use for visibility map testing, if any
  *		PscanLen		   size of parallel index-only scan descriptor
+ *		SkipPrefixSize	   number of keys for skip-based DISTINCT
+ *		FirstTupleEmitted  has the first tuple been emitted
  * ----------------
  */
 typedef struct IndexOnlyScanState
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 6b3eefebc6..c0c91d4d09 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -411,6 +411,8 @@ typedef struct IndexScan
 	List	   *indexorderbyorig;	/* the same in original form */
 	List	   *indexorderbyops;	/* OIDs of sort ops for ORDER BY exprs */
 	ScanDirection indexorderdir;	/* forward or backward or don't care */
+	int			indexskipprefixsize;	/* the size of the prefix for distinct
+										 * scans */
 } IndexScan;
 
 /* ----------------
diff --git a/src/test/regress/expected/select_distinct.out b/src/test/regress/expected/select_distinct.out
index 36b3291a7f..b8f747d8d2 100644
--- a/src/test/regress/expected/select_distinct.out
+++ b/src/test/regress/expected/select_distinct.out
@@ -445,14 +445,12 @@ SELECT DISTINCT ON (a) a, b FROM distinct_a ORDER BY a DESC, b DESC;
 -- test index skip scan for expressions
 EXPLAIN (COSTS OFF)
 SELECT DISTINCT (a + 1) FROM distinct_a ORDER BY (a + 1);
-             QUERY PLAN             
-------------------------------------
- Sort
-   Sort Key: ((a + 1))
-   ->  HashAggregate
-         Group Key: (a + 1)
-         ->  Seq Scan on distinct_a
-(5 rows)
+                     QUERY PLAN                     
+----------------------------------------------------
+ Index Scan using distinct_a_expr_idx on distinct_a
+   Skip scan: true
+   Distinct Prefix: 1
+(3 rows)
 
 SELECT DISTINCT (a + 1) FROM distinct_a ORDER BY (a + 1);
  ?column? 
@@ -693,6 +691,56 @@ FETCH BACKWARD ALL FROM c;
 
 END;
 DROP TABLE distinct_abc;
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+ 2 | 1 | 10
+ 3 | 1 | 10
+ 4 | 1 | 10
+ 5 | 1 | 10
+(5 rows)
+
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+ a | b | c  
+---+---+----
+ 1 | 1 | 10
+(1 row)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+   Distinct Prefix: 1
+(3 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+                    QUERY PLAN                     
+---------------------------------------------------
+ Index Scan using distinct_a_a_b_idx on distinct_a
+   Skip scan: true
+   Distinct Prefix: 1
+   Index Cond: (a = 1)
+(4 rows)
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT *
+FROM distinct_a;
+          QUERY PLAN          
+------------------------------
+ HashAggregate
+   Group Key: a, b, c
+   ->  Seq Scan on distinct_a
+(3 rows)
+
 -- check colums order
 SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
  a 
diff --git a/src/test/regress/sql/select_distinct.sql b/src/test/regress/sql/select_distinct.sql
index c9ccf4cc7d..b0d2ee7066 100644
--- a/src/test/regress/sql/select_distinct.sql
+++ b/src/test/regress/sql/select_distinct.sql
@@ -294,6 +294,22 @@ END;
 
 DROP TABLE distinct_abc;
 
+-- index skip scan
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT ON (a) a, b, c
+FROM distinct_a WHERE a = 1 ORDER BY a;
+EXPLAIN (COSTS OFF)
+SELECT DISTINCT *
+FROM distinct_a;
+
 -- check colums order
 SELECT DISTINCT a FROM distinct_a WHERE b = 2 AND c = 10;
 
-- 
2.32.0

#12

Thomas Munro

thomas.munro@gmail.com

almost 4 years ago

In reply to: Andres Freund (#10)

Re: MDAM techniques and Index Skip Scan patch

On Tue, Mar 22, 2022 at 2:34 PM Andres Freund <andres@anarazel.de> wrote:

IMO it's pretty clear that having "duelling" patches below one CF entry is a
bad idea. I think they should be split, with inactive approaches marked as
returned with feeback or whatnot.

I have the impression that this thread is getting some value from
having a CF entry, as a multi-person collaboration where people are
trading ideas and also making progress that no one wants to mark as
returned, but it's vexing for people managing the CF because it's not
really proposed for 15. Perhaps what we lack is a new status, "Work
In Progress" or something?

#13

Tom Lane

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Peter Geoghegan (#1)

Re: MDAM techniques and Index Skip Scan patch

Peter Geoghegan <pg@bowt.ie> writes:

Like many difficult patches, the skip scan patch is not so much
troubled by problems with the implementation as it is troubled by
*ambiguity* about the design. Particularly concerning how skip scan
meshes with existing designs, as well as future designs --
particularly designs for other MDAM techniques. I've started this
thread to have a big picture conversation about how to think about
these things.

Peter asked me off-list to spend some time thinking about the overall
direction we ought to be pursuing here. I have done that, and here
are a few modest suggestions.

1. Usually I'm in favor of doing this sort of thing in an index AM
agnostic way, but here I don't see much point. All of the ideas at
stake rely fundamentally on having a lexicographically-ordered multi
column index; but we don't have any of those except btree, nor do
I think we're likely to get any soon. This motivates the general
tenor of my remarks below, which is "do it in access/nbtree/ not in
the planner".

2. The MDAM paper Peter cited is really interesting. You can see
fragments of those ideas in our existing btree code, particularly in
the scan setup stuff that detects redundant or contradictory keys and
determines a scan start strategy. The special handling we implemented
awhile ago for ScalarArrayOp index quals is also a subset of what they
were talking about. It seems to me that if we wanted to implement more
of those ideas, the relevant work should almost all be done in nbtree
proper. The planner would need only minor adjustments: btcostestimate
would have to be fixed to understand the improvements, and there are
some rules in indxpath.c that prevent us from passing "too complicated"
sets of indexquals to the AM, which would need to be relaxed or removed
altogether.

3. "Loose" indexscan (i.e., sometimes re-descend from the tree root
to find the next index entry) is again something that seems like it's
mainly nbtree's internal problem. Loose scan is interesting if we
have index quals for columns that are after the first column that lacks
an equality qual, otherwise not. I've worried in the past that we'd
need planner/statistical support to figure out whether a loose scan
is likely to be useful compared to just plowing ahead in the index.
However, that seems to be rendered moot by the idea used in the current
patchsets, ie scan till we find that we'll have to step off the current
page, and re-descend at that point. (When and if we find that that
heuristic is inadequate, we could work on passing some statistical data
forward. But we don't need any in the v1 patch.) Again, we need some
work in btcostestimate to understand how the index scan cost will be
affected, but I still don't see any pressing need for major planner
changes or plan tree contents changes.

4. I find each of the above ideas to be far more attractive than
optimizing SELECT-DISTINCT-that-matches-an-index, so I don't really
understand why the current patchsets seem to be driven largely
by that single use-case. I wouldn't even bother with that for the
initial patch. When we do get around to it, it still doesn't need
major planner support, I think --- again fixing the cost estimation
is the bulk of the work. Munro's original 2014 patch showed that we
don't really need all that much to get the planner to build such a
plan; the problem is to convince it that that plan will be cheap.

In short: I would throw out just about all the planner infrastructure
that's been proposed so far. It looks bulky, expensive, and
drastically undercommented, and I don't think it's buying us anything
of commensurate value. The part of the planner that actually needs
serious thought is btcostestimate, which has been woefully neglected in
both of the current patchsets.

BTW, I've had a bee in my bonnet for a long time about whether some of
nbtree's scan setup work could be done once during planning, rather than
over again during each indexscan start. This issue might become more
pressing if the work becomes significantly more complicated/expensive,
which these ideas might cause. But it's a refinement that could be
left for later --- and in any case, the responsibility would still
fundamentally be nbtree's. I don't think the planner would do more
than call some AM routine that could add decoration to an IndexScan
plan node.

Now ... where did I put my flameproof vest?

regards, tom lane

#14

Andres Freund

andres@anarazel.de

almost 4 years ago

In reply to: Tom Lane (#13)

Re: MDAM techniques and Index Skip Scan patch

Hi,

On 2022-03-22 16:55:49 -0400, Tom Lane wrote:

4. I find each of the above ideas to be far more attractive than
optimizing SELECT-DISTINCT-that-matches-an-index, so I don't really
understand why the current patchsets seem to be driven largely
by that single use-case.

It's something causing plenty pain in production environments... Obviously
it'd be even better if the optimization also triggered in cases like
SELECT some_indexed_col FROM blarg GROUP BY some_indexed_col
which seems to be what ORMs like to generate.

BTW, I've had a bee in my bonnet for a long time about whether some of
nbtree's scan setup work could be done once during planning, rather than
over again during each indexscan start.

It does show up in simple-index-lookup heavy workloads. Not as a major thing,
but it's there. And it's just architecturally displeasing :)

Are you thinking of just moving the setup stuff in nbtree (presumably parts of
_bt_first() / _bt_preprocess_keys()) or also stuff in
ExecIndexBuildScanKeys()?

The latter does show up a bit more heavily in profiles than nbtree specific
setup, and given that it's generic executor type stuff, seems even more
amenable to being moved to plan time.

Greetings,

Andres Freund

#15

Tom Lane

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Andres Freund (#14)

Re: MDAM techniques and Index Skip Scan patch

Andres Freund <andres@anarazel.de> writes:

On 2022-03-22 16:55:49 -0400, Tom Lane wrote:

BTW, I've had a bee in my bonnet for a long time about whether some of
nbtree's scan setup work could be done once during planning, rather than
over again during each indexscan start.

It does show up in simple-index-lookup heavy workloads. Not as a major thing,
but it's there. And it's just architecturally displeasing :)
Are you thinking of just moving the setup stuff in nbtree (presumably parts of
_bt_first() / _bt_preprocess_keys()) or also stuff in
ExecIndexBuildScanKeys()?

Didn't really have specifics in mind. The key stumbling block is
that some (not all) of the work depends on knowing the specific
values of the indexqual comparison keys, so while you could do
that work in advance for constant keys, you'd still have to be
prepared to do work at scan start for non-constant keys. I don't
have a clear idea about how to factorize that effectively.

A couple of other random ideas in this space:

* I suspect that a lot of this work overlaps with the efforts that
btcostestimate makes along the way to getting a cost estimate.
So it's interesting to wonder whether we could refactor so that
btcostestimate is integrated with this hypothetical plan-time key
preprocessing and doesn't duplicate work.

* I think that we run through most or all of that preprocessing
logic even for internal catalog accesses, where we know darn well
how the keys are set up. We ought to think harder about how we
could short-circuit pointless work in those code paths.

I don't think any of this is an essential prerequisite to getting
something done for loose index scans, which ISTM ought to be the first
point of attack for v16. Loose index scans per se shouldn't add much
to the key preprocessing costs. But these ideas likely would be
useful to look into before anyone starts on the more complicated
preprocessing that would be needed for the ideas in the MDAM paper.

regards, tom lane

#16

Dmitry Dolgov

9erthalion6@gmail.com

almost 4 years ago

In reply to: Tom Lane (#13)

Re: MDAM techniques and Index Skip Scan patch

On Tue, Mar 22, 2022 at 04:55:49PM -0400, Tom Lane wrote:
Peter Geoghegan <pg@bowt.ie> writes:

Like many difficult patches, the skip scan patch is not so much
troubled by problems with the implementation as it is troubled by
*ambiguity* about the design. Particularly concerning how skip scan
meshes with existing designs, as well as future designs --
particularly designs for other MDAM techniques. I've started this
thread to have a big picture conversation about how to think about
these things.

Peter asked me off-list to spend some time thinking about the overall
direction we ought to be pursuing here. I have done that, and here
are a few modest suggestions.

Thanks. To make sure I understand your proposal better, I have a couple
of questions:

In short: I would throw out just about all the planner infrastructure
that's been proposed so far. It looks bulky, expensive, and
drastically undercommented, and I don't think it's buying us anything
of commensurate value.

Broadly speaking planner related changes proposed in the patch so far
are: UniqueKey, taken from the neighbour thread about select distinct;
list of uniquekeys to actually pass information about the specified
loose scan prefix into nbtree; some verification logic to prevent
applying skipping when it's not supported. I can imagine taking out
UniqueKeys and passing loose scan prefix in some other form (the other
parts seems to be essential) -- is that what you mean?

#17

Tom Lane

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Dmitry Dolgov (#16)

Re: MDAM techniques and Index Skip Scan patch

Dmitry Dolgov <9erthalion6@gmail.com> writes:

On Tue, Mar 22, 2022 at 04:55:49PM -0400, Tom Lane wrote:

In short: I would throw out just about all the planner infrastructure
that's been proposed so far. It looks bulky, expensive, and
drastically undercommented, and I don't think it's buying us anything
of commensurate value.

Broadly speaking planner related changes proposed in the patch so far
are: UniqueKey, taken from the neighbour thread about select distinct;
list of uniquekeys to actually pass information about the specified
loose scan prefix into nbtree; some verification logic to prevent
applying skipping when it's not supported. I can imagine taking out
UniqueKeys and passing loose scan prefix in some other form (the other
parts seems to be essential) -- is that what you mean?

My point is that for pure loose scans --- that is, just optimizing a scan,
not doing AM-based duplicate-row-elimination --- you do not need to pass
any new data to btree at all. It can infer what to do on the basis of the
set of index quals it's handed.

The bigger picture here is that I think the reason this patch series has
failed to progress is that it's too scattershot. You need to pick a
minimum committable feature and get that done, and then you can move on
to the next part. I think the minimum committable feature is loose scans,
which will require a fair amount of work in access/nbtree/ but very little
new planner code, and will be highly useful in their own right even if we
never do anything more.

In general I feel that the UniqueKey code is a solution looking for a
problem, and that treating it as the core of the patchset is a mistake.
We should be driving this work off of what nbtree needs to make progress,
and not building more infrastructure elsewhere than we have to. Maybe
we'll end up with something that looks like UniqueKeys, but I'm far from
convinced of that.

regards, tom lane

#18

Dmitry Dolgov

9erthalion6@gmail.com

almost 4 years ago

In reply to: Tom Lane (#17)

Re: MDAM techniques and Index Skip Scan patch

On Wed, Mar 23, 2022 at 05:32:46PM -0400, Tom Lane wrote:
Dmitry Dolgov <9erthalion6@gmail.com> writes:

On Tue, Mar 22, 2022 at 04:55:49PM -0400, Tom Lane wrote:

In short: I would throw out just about all the planner infrastructure
that's been proposed so far. It looks bulky, expensive, and
drastically undercommented, and I don't think it's buying us anything
of commensurate value.

Broadly speaking planner related changes proposed in the patch so far
are: UniqueKey, taken from the neighbour thread about select distinct;
list of uniquekeys to actually pass information about the specified
loose scan prefix into nbtree; some verification logic to prevent
applying skipping when it's not supported. I can imagine taking out
UniqueKeys and passing loose scan prefix in some other form (the other
parts seems to be essential) -- is that what you mean?

My point is that for pure loose scans --- that is, just optimizing a scan,
not doing AM-based duplicate-row-elimination --- you do not need to pass
any new data to btree at all. It can infer what to do on the basis of the
set of index quals it's handed.

The bigger picture here is that I think the reason this patch series has
failed to progress is that it's too scattershot. You need to pick a
minimum committable feature and get that done, and then you can move on
to the next part. I think the minimum committable feature is loose scans,
which will require a fair amount of work in access/nbtree/ but very little
new planner code, and will be highly useful in their own right even if we
never do anything more.

In general I feel that the UniqueKey code is a solution looking for a
problem, and that treating it as the core of the patchset is a mistake.
We should be driving this work off of what nbtree needs to make progress,
and not building more infrastructure elsewhere than we have to. Maybe
we'll end up with something that looks like UniqueKeys, but I'm far from
convinced of that.

I see. I'll need some thinking time about how it may look like (will
probably return with more questions).

The CF item could be set to RwF, what would you say, Jesper?

#19

Jesper Pedersen

jesper.pedersen@redhat.com

almost 4 years ago

In reply to: Dmitry Dolgov (#18)

Re: MDAM techniques and Index Skip Scan patch

On 3/23/22 18:22, Dmitry Dolgov wrote:

The CF item could be set to RwF, what would you say, Jesper?

We want to thank the community for the feedback that we have received
over the years for this feature. Hopefully a future implementation can
use Tom's suggestions to get closer to a committable solution.

Here is the last CommitFest entry [1]https://commitfest.postgresql.org/37/1741/ for the archives.

RwF

[1]: https://commitfest.postgresql.org/37/1741/

Best regards,
Dmitry & Jesper

#20

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Andres Freund (#14)

Re: MDAM techniques and Index Skip Scan patch

On Tue, Mar 22, 2022 at 4:06 PM Andres Freund <andres@anarazel.de> wrote:

Are you thinking of just moving the setup stuff in nbtree (presumably parts of
_bt_first() / _bt_preprocess_keys()) or also stuff in
ExecIndexBuildScanKeys()?

The latter does show up a bit more heavily in profiles than nbtree specific
setup, and given that it's generic executor type stuff, seems even more
amenable to being moved to plan time.

When I was working on the patch series that became the nbtree Postgres
12 work, this came up. At one point I discovered that using palloc0()
for the insertion scankey in _bt_first() was a big problem with nested
loop joins -- it became a really noticeable bottleneck with one of my
test cases. I independently discovered what Tom must have figured out
back in 2005, when he committed d961a56899. This led to my making the
new insertion scan key structure (BTScanInsertData) not use dynamic
allocation. So _bt_first() is definitely performance critical for
certain types of queries.

We could get rid of dynamic allocations for BTStackData in
_bt_first(), perhaps. The problem is that there is no simple,
reasonable proof of the maximum height on a B-tree, even though a
B-Tree with more than 7 or 8 levels seems extraordinarily unlikely.
You could also invent a slow path (maybe do what we do in
_bt_insert_parent() in the event of a concurrent root page split/NULL
stack), but that runs into the problem of being awkward to test, and
pretty ugly. It's doable, but I wouldn't do it unless there was a
pretty noticeable payoff.

--
Peter Geoghegan

#21

Tom Lane

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Peter Geoghegan (#20)

Re: MDAM techniques and Index Skip Scan patch

Peter Geoghegan <pg@bowt.ie> writes:

We could get rid of dynamic allocations for BTStackData in
_bt_first(), perhaps. The problem is that there is no simple,
reasonable proof of the maximum height on a B-tree, even though a
B-Tree with more than 7 or 8 levels seems extraordinarily unlikely.

Start with a few entries preallocated, and switch to dynamically
allocated space if there turn out to be more levels than that,
perhaps? Not sure if it's worth the trouble.

In any case, what I was on about is _bt_preprocess_keys() and
adjacent code. I'm surprised that those aren't more expensive
than one palloc in _bt_first. Maybe that logic falls through very
quickly in simple cases, though.

regards, tom lane

#22

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Tom Lane (#13)

Re: MDAM techniques and Index Skip Scan patch

On Tue, Mar 22, 2022 at 1:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Peter asked me off-list to spend some time thinking about the overall
direction we ought to be pursuing here.

Thanks for taking a look!

"5.5 Exploiting Key Prefixes" and "5.6 Ordered Retrieval" from "Modern
B-Tree Techniques" are also good, BTW.

The terminology in this area is a mess. MySQL calls
SELECT-DISTINCT-that-matches-an-index "loose index scans". I think
that you're talking about skip scan when you say "loose index scan".
Skip scan is where there is an omitted prefix of columns in the SQL
query -- omitted columns after the first column that lack an equality
qual. Pretty sure that MySQL/InnoDB can't do that -- it can only
"skip" to the extent required to make
SELECT-DISTINCT-that-matches-an-index perform well, but that's about
it.

It might be useful for somebody to go write a "taxonomy of MDAM
techniques", or a glossary. The existing "Loose indexscan" Postgres
wiki page doesn't seem like enough. Something very high level and
explicit, with examples, just so we don't end up talking at cross
purposes too much.

1. Usually I'm in favor of doing this sort of thing in an index AM
agnostic way, but here I don't see much point. All of the ideas at
stake rely fundamentally on having a lexicographically-ordered multi
column index; but we don't have any of those except btree, nor do
I think we're likely to get any soon. This motivates the general
tenor of my remarks below, which is "do it in access/nbtree/ not in
the planner".

That was my intuition all along, but I didn't quite have the courage
to say so -- sounds too much like something that an optimizer
dilettante like me would be expected to say. :-)

Seems like one of those things where lots of high level details
intrinsically need to be figured out on-the-fly, at execution time,
rather than during planning. Perhaps it'll be easier to correctly
determine that a skip scan plan is the cheapest in practice than to
accurately cost skip scan plans. If the only alternative is a
sequential scan, then perhaps a very approximate cost model will work
well enough. It's probably way too early to tell right now, though.

I've worried in the past that we'd
need planner/statistical support to figure out whether a loose scan
is likely to be useful compared to just plowing ahead in the index.

I don't expect to be able to come up with a structure that leaves no
unanswered questions about future MDAM work -- it's not realistic to
expect everything to just fall into place. But that's okay. Just
having everybody agree on roughly the right conceptual model is the
really important thing. That now seems quite close, which I count as
real progress.

4. I find each of the above ideas to be far more attractive than
optimizing SELECT-DISTINCT-that-matches-an-index, so I don't really
understand why the current patchsets seem to be driven largely
by that single use-case. I wouldn't even bother with that for the
initial patch.

I absolutely agree. I wondered about that myself in the past. My best
guess is that a certain segment of users are familiar with
SELECT-DISTINCT-that-matches-an-index from MySQL. And so to some
extent application frameworks evolved in a world where that capability
existed. IIRC Jesper once said that Hibernate relied on this
capability.

It's probably a lot easier to implement
SELECT-DISTINCT-that-matches-an-index if you have the MySQL storage
engine model, with concurrency control that's typically based on
two-phase locking. I think that MySQL does some amount of
deduplication in its executor here -- and *not* in what they call the storage
engine. This is exactly what I'd like to avoid in Postgres; as I said
"Maintenance of Index Order" (as the paper calls it) seems important,
and not something to be added later on. Optimizer and executor layers
that each barely know the difference between a skip scan and a full
index scan seems like something we might actually want to aim for,
rather than avoid. Teaching nbtree to transform quals into ranges
sounds odd at first, but it seems like the right approach now, on
balance -- that's the only *good* way to maintain index order.
(Maintaining index order is needed to avoid needing or relying on
deduplication in the executor proper, which is even inappropriate in
an implementation of SELECT-DISTINCT-that-matches-an-index IMO.)

--
Peter Geoghegan

#23

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Tom Lane (#21)

Re: MDAM techniques and Index Skip Scan patch

On Mon, Mar 28, 2022 at 5:21 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

In any case, what I was on about is _bt_preprocess_keys() and
adjacent code. I'm surprised that those aren't more expensive
than one palloc in _bt_first. Maybe that logic falls through very
quickly in simple cases, though.

I assume that it doesn't really appear in very simple cases (also
common cases). But delaying the scan setup work until execution time
does seem ugly. That's probably a good enough reason to refactor.

--
Peter Geoghegan

#24

Tom Lane

tgl@sss.pgh.pa.us

almost 4 years ago

In reply to: Peter Geoghegan (#22)

Re: MDAM techniques and Index Skip Scan patch

Peter Geoghegan <pg@bowt.ie> writes:

The terminology in this area is a mess. MySQL calls
SELECT-DISTINCT-that-matches-an-index "loose index scans". I think
that you're talking about skip scan when you say "loose index scan".
Skip scan is where there is an omitted prefix of columns in the SQL
query -- omitted columns after the first column that lack an equality
qual.

Right, that's the case I had in mind --- apologies if my terminology
was faulty. btree can actually handle such a case now, but what it
fails to do is re-descend from the tree root instead of plowing
forward in the index to find the next matching entry.

It might be useful for somebody to go write a "taxonomy of MDAM
techniques", or a glossary.

+1. We at least need to be sure we all are using these terms
the same way.

regards, tom lane

#25

Peter Geoghegan

pg@bowt.ie

almost 4 years ago

In reply to: Tom Lane (#24)

Re: MDAM techniques and Index Skip Scan patch

On Mon, Mar 28, 2022 at 7:07 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Right, that's the case I had in mind --- apologies if my terminology
was faulty. btree can actually handle such a case now, but what it
fails to do is re-descend from the tree root instead of plowing
forward in the index to find the next matching entry.

KNNGIST seems vaguely related to what we'd build for nbtree skip scan,
though. GiST index scans are "inherently loose", though. KNNGIST uses
a pairing heap/priority queue, which seems like the kind of thing
nbtree skip scan can avoid.

+1. We at least need to be sure we all are using these terms
the same way.

Yeah, there are *endless* opportunities for confusion here.

--
Peter Geoghegan